Distance Debugging Logo

One of the hobby horses of this blog is that debugging is hard because of human problems, and not necessarily because of computers. A recent story linked on Slashdot regarding problems with multiple computer failures aboard the international space station is a great illustration of that problem, in a very high stakes debugging situation.
From Space Station: Internal NASA Reports Explain Origins of June Computer Crisis:

During the first days of the computer failure in June, the station's atmosphere control system seized up. The failure also knocked out the autopilot's ability to fire maneuvering thrusters to hold the station steady during the undocking of the space shuttle, which had arrived on 10 June. The terse description in the NASA internal technical report on the crisis, obtained by IEEE Spectrum, put it this way: On 13 June, a complete shutdown of secondary power to all [three] central computer and terminal computer channels occurred, resulting in the loss of capability to control ISS Russian segment systems.

That's really, really bad.But what's worse is the response to the problem:

Russian officials were quick to blame NASA for 'zapping their computers' with 'dirty' 28-volt power from a newly installed solar power wing. Another Russian explanation was that the expanded station structure (the main purpose of the shuttle visit) might be excessively charging up due to its orbital speed through Earth's magnetic field. These were the first of many bad guesses by top Russian program managers that would distract engineers trying to get at the real problem. [My italics added for emphasis]

Now I'm not willing to hold the US team as blameless as this author does (a NASA team member) despite the clear faults with the Russian systems that were eventually uncovered. However, I agree with his characterization of these problems as "guesses" and that they were "distracting". Interestingly, in this case, they took an action that fixed the problem, although no one could explain why:

The initial assumption was that some external interference, such as noise on the power supply, was responsible for generating false commands inside the computer system. On the assumption that the bad commands were coming from inside a power-monitoring device, the crew bypassed it on two of the three downed computers, using jumper cables. By the time the shuttle undocked on 19 June, the computers began to function normally?or so it seemed. Replacement parts were quickly manifested on a robot supply ship, while ground engineers wrestled with the fundamental question of cause and effect.Analysis teams still had to determine why the computers failed, and why the jumper cables seemed to fix the problem. More important, they needed to know whether the problem really was fixed, or whether something could again trigger the systemwide crash of the supposedly triply redundant architecture.

In the end, it turns out that they had guessed right about the source of the problem being bad commands from the power-monitoring device, but the why took some explaining.Essentially a short-circuit resulting from corrosion, itself the result of poor design, was causing the power-monitoring device to send a "shut down" message to the computers, which was a "misfeature" designed to protect them from voltage spikes that would damage them.? Instead of there being triple redundancy, all three were brought down by the same underlying cause.
This situation first brings to mind one of the rules from Debugging Rules: Stop Thinking and Look! In this case it turns out by simply opening the case, it was immediately clear that things weren't right: electronics were wet.

In the weeks that followed the crisis and apparent recovery, station commander Fyodor Yurchikhin and his fellow cosmonaut Oleg Kotov disassembled the boxes and cabling and inspected every angle of the hardware, occasionally assisted by their American crewmate, Clayton Anderson. Multiple scopes and probes had failed to find the flaw, but their eyes and fingers eventually did.
The connection pins from the power-monitoring device they'd bypassed earlier, they found, were wet?and corroded. The final report described the change in appearance of fasteners on one box's connectors and noted the presence of deposits and residue on the housings, and residue and spots on the contact surfaces.

If they had just stopped theorizing and looked inside, it would have been immediately clear that something was very, very wrong. From a Distance Debugging standpoint, the issue was "the danger of the unfamiliar", more technically know as Ingroup Bias. Here the issue was directly attributable to general mistrust between US and Russian astronauts, and the social boundaries that exist between them.? However, in any system, we will tend to blame "other" systems, the ones that we didn't build or those of which we have less knowledge. I would guess that in this case, even if the the two groups were both US but had developed their technologies independently, the finger-pointed would have been the same.
In any case, it is thankful that they were able to workaround and eventually resolve the problem, but it shows how pervasive debugging problems are. Even the most highly trained astronauts and scientists resorted to the same heuristics and biases that tend to guide the rest of us when we go about fixing problems.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
We hate to do this, but to comment, you'll need to prove you aren't a spambot by answering this question: