I have taken to watching a show on, I believe the Discovery Channel or maybe Discovery Health?, called Mystery Diagnosis. The idea is simple: it tells the true story of someone with a disease for which they had trouble getting a correct diagnosis, but then eventually they either talk to the right person or get the right information to find out what the problem is. In many cases, they've needlessly endured a treatable illness or condition for years, and the resulting diagnosis and treatment is life-changing. It's a feel good program, but it raises a lot of critical issues.

I've noted a few recurring themes in the program:

  • Almost invariably, the sufferer is given a cursory diagnosis that, at least in hindsight, appears to match their observed symptoms extremely poorly. Often friends and family members are extremely frustrated by the failure to dig deeply into the illness, while the patient is often oddly satisfied by the explanation despite their ongoing suffering.
  • The "mystery" is often not very mysterious at all once a doctor cares to take a serious look at the problem. In one recent episode, the initial doctor repeatedly dismisses a man's massive adominal pain and horrific bleeding as "bad hemorrhoids". After the man switches insurance and is essentially forced to see a different doctor because of network coverage changes, he is immediately given a full colonoscopy which reveals stage III colon cancer. That's far from a "mystery" diagnosis and more like massive malpractice by the initial doctor.
  • The succesful doctor usually solves the mystery by using a combination of careful inspection of the patient conditions, listening and really trusting the patient's self-reported problems, targeted diagnostics, and reasoning about pathology. Once a careful catalog is made of the issues, history, and some simple bloodwork, the problem is almost always fairly obvious.

I can't help draw parallels to my experience diagnosing software failures. A mysterious problem often turns out to be fairly obvious when the symptoms are actually catalogued and investigated, and there are often one or more "experts" who have come up with a variety of hypotheses, none of which seem to match the available facts. In the world of medicine, certainly the stakes are much higher, but a serious software bug can still cost dozens of companies hundreds of hours in downtime or lost productivity. I wonder the same thing about both medicine and software: what will it take for the field to decide to get better at diagnosis?

I've been working on a project that ties together a few different systems and technologies, and the whole thing is written in C++ and uses a lot of MFC and Windows libraries (don't ask, it's a long story).  I was working on integrating a third-party application that produces XML documents that are parsed by the application under development when I hit a weird snag.  The third-party application would produce an schema file with a namespace declaration like this:

 

xmlns:xml="http:www.w3.org/XML/1998/namespace"

 

Note that this is technically legal.  The namespace prefix "xml" is reserved, but if defined, it must point to the W3C namespace schema definition.  Technically, it is implied so it is not present in most documents, but making the legal, yet redundant, declaration is not technically wrong.  However, the older Microsoft XML library I have been working with (i.e. MSXML which is pre-.NET) complains about it saying "the namespace prefix 'xml' is reserved and may not be declared".  It's not checking if it has been declared correctly. 

So I'm stuck between two pieces of code I don't control: one that is doing something that is technically legal but unnecessary, and the other that is performing an error check that is sort of correct, but not really.  So where is the bug?  It's a little bit on both sides, or what I call a "bug in the cracks": something that emerges from the interaction of two pieces that are doing something fairly reasonable, but in an incompatible way.  My workaround is somewhat unfortunate. I have to do a string-level manipulation of the XML document before it gets passed to the MSXML parser to strip out the offending namespace declaration, since I can't actually parse it to fix it.  Hopefully we'll soon be able to upgrade to a newer version of the XML libraries (or maybe even move the whole thing to C#/.NET) and then this problem and the associated workaround will simply vanish.

As I have been programming for a fairly long time now, I have noticed a very odd thing happen to me emotionally when I am writing code.  I get very invested in writing and debugging it, almost like I would a novel or movie.  When I put in a fix for a problem and rerun the test case, I feel anxious, as if some sort of gruesome fate awaits me if my assert() fails, and not just an angry red bar telling me the test failed.  On the flip side, when it runs cleanly, I get a rush that's I've dubbed the "Coder's High" because it feels like you are the smartest person on the face of the earth for at least 5 seconds.

The effect carries over to the process itself, such as when I'm interrupted by the end of the workday or a phone call when I am in the middle of writing or debugging something.  It's more than the mild discomfort experienced when leaving the flow state associated with programming. It feels like I've been deprived of the last 10 pages of a mystery, or the last 5 minutes of a suspense thriller.  I actually want to return to the keyboard to find out how it ends.

I had one of those moments this morning when I finally conquered the jquery demons that were circling around my project (in case you were wondering, .children doesn't search the DOM recursively, but .find does; I actually knew that but for some reason I looked at the function 10 times before I noticed I was calling the wrong one).  The psychological benefit has carried me through most of my day, and I feel like I was a much better programmer because of it.  I have discovered that I like that effect more than the "find out how it ends" annoyance, and I've learned to work on a problem up until I believe I have the solution, then actually wait until the next day to try it out.  That way I have something to work on right away when I get in instead of procrastinating, and I might just get a little shot of adrenaline and endorphins out of it.

A few weeks ago, I described the probability-testability matrix, and emphasized that you should consider both how likely a theory is, and how difficult it is to test that theory, when choosing how to proceed in a debugging effort. I have always been surprised that people seem to have trouble with this, and choose very low probability things to try first regardless of testability. I haven't had the resources to actually carry out a study on this, but I have a strong feeling that a cognitive bias is to blame, although I don't know if it already has a name in the literature. I've dubbed it the "Holding out Hope" bias. The idea is that people are biased against taking actions that leave them with worse options, even if selecting the better options means that the likelihood of success is higher early on.

To be more concrete, let's say that I have 4 theories for a problem, that I've guessed are 70, 50, 5, and 1 percent likely, respectively (note that it does not sum to 100% since I am simply giving a probability for each to be the outcome, and I might feel strongly about more than 1). If I start with the 70% theory and I'm wrong, then I've got something that's less likely to try. If I test the 50% and I'm still wrong, then I'm looking at a couple of low-probability theories and I'm starting to feel very uneasy. If I instead start with 1% and it's wrong, I can say "well I didn't think that was it, and I've got some much better options". I can continue to "Hold out Hope" that my higher likelihood theories are correct. It results in failure feeling like I'm improving my chances instead of reducing them.

Of course, by trying the things that we think are less likely first, we are greatly increasingly the number of theories that have to be tested, and that makes the whole effort take longer, even if it helps with morale. Remember that it's the fixing part that feels good, so avoid the Hold out Hope bias and go for the high probability stuff first.

Most of us are introduced to the scientific method early on. We learn about making hypotheses, collecting data, and analyzing results. Unfortunately, we are rarely given an opportunity to study the second part of science, which is establishing a theoretical basis for the result that meshes with current understanding. One simple example is the classic "ESP" experiment. Let's say that I could demonstrate successful "mind reading" to a level that is incredibly statistically unlikely, and I could do so repeatably. In order for it to become "scientific", that's simply not sufficient.  I would need to explain the following:

  1. In what form is the information being transmitted from the other person to me?
  2. How might that be observable through other means?
  3. What about me is different such that I have this capability while others do not?

Otherwise, I'm making an even bolder claim: not only can I read minds, but I also rely on some method of data transmission that is undetectable or unknown to science.  To avoid that territory, we certainly we could invent answers to these questions that made scientific sense. Perhaps I am especially attuned to the magnetic fields created by neuron firings, and I can decode that information. Maybe I have some sort of genetic anomaly that amplified a natural capability, and so on. Many incredibly bizarre and counterintuitive findings have proven to be totally explainable (or have invalided previously established scientific concepts) once the theoretical basis is established. The need to link new findings in with established scientific precedent
strikes many people as a kind of dampening effect on research that
challenges the status quo. In some sense, that's true, but it's really
about burden of proof. If you find something that is significantly at
odds with long-standing research with a significant experimental basis,
you better have both a good explanation, and triple-checked your work.

The problem isn't that people don't create good theoretical bases for their experiments, it's that they don't  consider them enough in the first place. I remember a class in graduate school where the professor recounted the "surprising" finding that deaf children acquire sign language in the same way (progressing from babbling of word elements through individual words through sentences and so on) and at the same rate that hearing children learn spoken language. I immediately thought: "Why is that surprising?" Since we have such a wide basis of theory and experimental knowledge about language and the brain's innate capability to acquire it, wouldn't we be a lot more surprised if deaf children were somehow stunted in their language development? That would be a very interesting finding, suggesting that humans are predisposed to spoken language specifically, which would fly in the face of many other theories, such as the fact that it's generally believed sign language predated spoken language. A finding that spoken language has since become favored would need it's own theoretical basis.

What does this have to do with debugging? As I have mentioned in this space many times before, you have to have a theory when working, and that theory better be plausible, ahd it should hopefully be probable. However, on the data collection side of things, those same rules must apply. When lacking a solid theory, we often being running test cases and other perturbations of the system to see if anything unusual turns up. Frequently though, those experiments break down into two outcomes: one that is trivial in that it fails to challenge our current understanding, and the other that is wildly at odds with our current understanding. To avoid these kinds of polarizing tests, it's good practice to look at the expected possible outcomes and say "If the test fails in the expected way, what does that mean? If the test succeeds, what does that mean?". Ideally, both should tell you something interesting, although that's not always possible. Most importantly, if your test does give you back a surprising result, start from the assumption that your test is wrong. You will save a lot of heartache by throwing out bad tests instead of becoming a true believer in a bizarre theory with one measly test case to support it.

One of the hobby horses of this blog is that debugging is hard because of human problems, and not necessarily because of computers. A recent story linked on Slashdot regarding problems with multiple computer failures aboard the international space station is a great illustration of that problem, in a very high stakes debugging situation.
From Space Station: Internal NASA Reports Explain Origins of June Computer Crisis:

During the first days of the computer failure in June, the station's atmosphere control system seized up. The failure also knocked out the autopilot's ability to fire maneuvering thrusters to hold the station steady during the undocking of the space shuttle, which had arrived on 10 June. The terse description in the NASA internal technical report on the crisis, obtained by IEEE Spectrum, put it this way: On 13 June, a complete shutdown of secondary power to all [three] central computer and terminal computer channels occurred, resulting in the loss of capability to control ISS Russian segment systems.

That's really, really bad.But what's worse is the response to the problem:

Russian officials were quick to blame NASA for 'zapping their computers' with 'dirty' 28-volt power from a newly installed solar power wing. Another Russian explanation was that the expanded station structure (the main purpose of the shuttle visit) might be excessively charging up due to its orbital speed through Earth's magnetic field. These were the first of many bad guesses by top Russian program managers that would distract engineers trying to get at the real problem. [My italics added for emphasis]

Now I'm not willing to hold the US team as blameless as this author does (a NASA team member) despite the clear faults with the Russian systems that were eventually uncovered. However, I agree with his characterization of these problems as "guesses" and that they were "distracting". Interestingly, in this case, they took an action that fixed the problem, although no one could explain why:

The initial assumption was that some external interference, such as noise on the power supply, was responsible for generating false commands inside the computer system. On the assumption that the bad commands were coming from inside a power-monitoring device, the crew bypassed it on two of the three downed computers, using jumper cables. By the time the shuttle undocked on 19 June, the computers began to function normally?or so it seemed. Replacement parts were quickly manifested on a robot supply ship, while ground engineers wrestled with the fundamental question of cause and effect.Analysis teams still had to determine why the computers failed, and why the jumper cables seemed to fix the problem. More important, they needed to know whether the problem really was fixed, or whether something could again trigger the systemwide crash of the supposedly triply redundant architecture.

In the end, it turns out that they had guessed right about the source of the problem being bad commands from the power-monitoring device, but the why took some explaining.Essentially a short-circuit resulting from corrosion, itself the result of poor design, was causing the power-monitoring device to send a "shut down" message to the computers, which was a "misfeature" designed to protect them from voltage spikes that would damage them.? Instead of there being triple redundancy, all three were brought down by the same underlying cause.
This situation first brings to mind one of the rules from Debugging Rules: Stop Thinking and Look! In this case it turns out by simply opening the case, it was immediately clear that things weren't right: electronics were wet.

In the weeks that followed the crisis and apparent recovery, station commander Fyodor Yurchikhin and his fellow cosmonaut Oleg Kotov disassembled the boxes and cabling and inspected every angle of the hardware, occasionally assisted by their American crewmate, Clayton Anderson. Multiple scopes and probes had failed to find the flaw, but their eyes and fingers eventually did.
The connection pins from the power-monitoring device they'd bypassed earlier, they found, were wet?and corroded. The final report described the change in appearance of fasteners on one box's connectors and noted the presence of deposits and residue on the housings, and residue and spots on the contact surfaces.

If they had just stopped theorizing and looked inside, it would have been immediately clear that something was very, very wrong. From a Distance Debugging standpoint, the issue was "the danger of the unfamiliar", more technically know as Ingroup Bias. Here the issue was directly attributable to general mistrust between US and Russian astronauts, and the social boundaries that exist between them.? However, in any system, we will tend to blame "other" systems, the ones that we didn't build or those of which we have less knowledge. I would guess that in this case, even if the the two groups were both US but had developed their technologies independently, the finger-pointed would have been the same.
In any case, it is thankful that they were able to workaround and eventually resolve the problem, but it shows how pervasive debugging problems are. Even the most highly trained astronauts and scientists resorted to the same heuristics and biases that tend to guide the rest of us when we go about fixing problems.

I am constantly admonishing companies to take IT seriously, in the way that they would take accounting or marketing seriously. In fact, I would argue that technology management should be a core part of a modern business curriculum beyond generic "Management of Business Technology" courses. Most businesses I know take only a passing interest in trying to keep tabs on where they are in terms of the applications they are running, their infrastructure, and in particular, having any idea what their employees could use to do their job better. So when I saw this post on TreeHugger about major power savings being found simply by turning totally unused machines off, it seemed like a nice metaphor for corporate IT problems in general.? From the article:

In some companies it may be the case that there are many servers that are left on for no good reason, simply to serve legacy applications. Mark Monroe, Sun Microsystem's director of sustainable computing, gave a talk where he explained that they were able to tuen off 10% of their servers in this way. He called the phenomenon ?data center drift?.

He went on to explain that a survey had found two companies had 504 ?mystery machines? out of 4,300 servers. When they were turned off they had no actual impact on the companies operations. This is something that should be simple to implement, but can have a dramatic impact on energy bills.

I particularly love the characterization of the phenomenon as 'data center drift', as it transforms incompetence into something that sounds almost natural.? In the above example, nearly an eighth of their servers were simply doing nothing that affected a single person if they were turned off.? Imagine if the accounting staff did an audit and determined that an eighth of the corporate budget was just being thrown into a giant pit and buried every day, but that they didn't notice because it used to be that the money was funding things, and it slowly shifted to the pit.? They would probably be fired on the spot.? I somehow doubt any of the IT staff were taken to task for this gross oversight.

When people discuss the creation of software, they make a clean separation between the tools that one can use, such as an IDE, and the techniques one can use, such as Object-Oriented Programming. In general, good tools can offer efficiencies, analysis, or insight that would be missed without it, and they might even help to reinforce the use of certain techniques, such as Eclipse's support for refactoring as a core piece of functionality.

For whatever reason, people have trouble making this same distinction with debugging. When people talk about it, they tend to refer to the tools they use for the job rather than their theoretical approach to problem solving. Case in point: a post about debugging on the Meebo Blog. It didn't contain any misinformation, but it only mentioned the tools that they use: gdb, core dumps, strace, etc.? It left me wondering: so you open up the core dump with gdb, then what?? How do you search through the massive dump to find the information you need?? Do you keep a detailed history of common errors? Do you generally have an idea of what is wrong before you start?

The notion that the availability or knowledge of certain tools shapes or empowers your thinking has been called "The Fingertip Effect" by Dr. David Perkins,? referring to the concept of? having resources "at your fingertips".? He used it to formulate an argument in the context of educational technologies, where the fingertip effect is generally taken as a given.? Parents frequently demand that a certain number of computers, laptops, PDAs, technology X? be present in every classroom or within the school in order to maximize student opportunities.? The implicit premise is that by making these resources available "at students' fingertips" they will suddenly be able to use them in an appropriate way, and they will naturally benefit.

Most research seems to suggest that this is not true (and Dr. Perkins coined the term for the purposes of investigating, and eventually discrediting the idea).? The problem is that students widely vary in their ability to make use of new technologies without corresponding instruction in its affordances (now I'm really pouring on the ed jargon).? So the fingertip effect isn't necessarily wrong, it's just not a good educational strategy;? technology has to be combined with an effective instructional program so that students actually understand what it can do for them, how it does it, and why.

The same can be said for debugging.? It isn't that a focus on tools such as gdb, strace, or even Delta Debugging is wrong, it's that if you don't have a framework for understanding how to isolate and fix bugs, tools won't help you.? In the same way, having Microsoft Word will not make a poor writer into a good one, even though it will allow a good writer the opportunity to become better since they can spend less time on copyediting and more time on style and content.? In retrospect, the idea that tools cannot be substituted for techniques seems obvious, yet parents continue to call for more computers without calling for more computer-savvy instructors, and the debugging field continues to offer a wider variety of tools without regard for instructing people in their use. My hope is that the material on this blog helps people make better use of the tools that are already at their fingertips.

From Engadget yesterday:

Merely three days after hearing of one user's run-in with Apple over his unlocked iPhone, the company has released an official statement warning users that "unauthorized iPhone unlocking programs" could cause "irreparable damage to the iPhone's software." Furthermore, the firm stated that these apps could result in the handset becoming "permanently inoperable when a future Apple-supplied iPhone software update is installed" -- you know, like the one coming "later this week" that includes the iTunes WiFi Music Store.

The team that developed the unlocking software offered a response today:

Based on download numbers, the iPhone Dev Team believes that, worldwide, several hundred thousand people have unlocked their iPhones. That number continues growing every day. The removal of the lock, a bug, was a major step forward in the iPhone development. It made the iPhone free and useful to anyone, not only to those in certain countries.

Apple now announces that the next firmware update, expected later this week, will possibly break the handset of all of us free users in the World. It speaks of "damage" done to the firmware and "unauthorized access" to our own property, The removal of those firmware problems, which were built in in favor for AT&T, does not cause "damage" as they want to make us believe.

We will provide you with a tool in the next week which will be able to recover your nck counter and seczones and even enables you to restore your phone to a Factory-like state.

Apple has taken a lot of flack for this statement, with people arguing that they are fearmongering with their claim that the unlock will break their phone when the firmware update comes out, and by making such an update in the first place.? My question is, why would Apple go to the trouble of telling people about the problem ahead of time?? Why not just release the update and break all the hacked phones, thereby making people wary of unlocking things at all?

To me, this whole exchange seems like some tricky subtextul communication between Apple and the iPhone hackers to keep customers, who want unlocked iPhones and give apple $, and vendors, who want locked iPhones and give apple $, happy while still allowing everyone to go about their business:

Apple: I'm going to give you until the count of 3 to put your iPhone back the way you found it! I'd hate to have to take away your privileges altogether (wink, wink)!

iPhone Dev Team: Okay, okay, you win.? I'll put the iPhone back the way it was.? You can roll out that upgrade whenever you'd like (wink, wink).

I guess we'll see what happens the next time around...

When called in to fix a computer problem at someone else's location, what should you bring with you to increase the chances of success?? Certainly, it helps to know something about the problem before walking in the door, but having a generic set of tools that will assist in a wide variety of situations is critical. Here is a list of what I put in my toolbox.

Hardware:

  • ?Laptop - You need your own computer since there may not be a working one present. Dual boot Linux/Windows, although you can stay in Linux, generally speaking. Better yet, run Linux and have a few virtual machines with different OSes ready to go.
  • Wireless (cellular) modem card - This is not necessary, but I like to have a high-speed, dedicated internet connection that keeps me from having to rely on the customer environment, especially if the network is the problem.
  • Wireless (802.11x) card - For working with/sniffing wireless networks
  • Hub - You can use a switch that has a sniffer port that allows you to see all the traffic, but a 4-port hub makes it easy to insert yourself into the network quickly and unobtrusively
  • Ethernet cables - Probably at least two standard and one crossover, just in case.
  • Computer screwdrivers - in case you need to open the case and look for evidence of physical problems, or pop out flaky hardware.
  • Thumb drive with all the software described below for as many OSes as they are supported on - Since a 2GB thumbdrive costs so little, this is easy, and allows you to quickly copy analysis and dev tools onto other machines.

Software

  • Standard network and comunication utilities - ping, traceroute, ssh, etc - helpful for checking the status of machines and for answering questions about networks
  • Standard network service daemons - dhcpd, named, etc - helpful for allowing your laptop to pose as various services.
  • Advanced network utilities - nmap, wireshark - for really looking at what is coming over the network, and analyzing hosts
  • Standard? gcc toolchain - don't leave home without it
  • A serious IDE such as Eclipse - I like Eclipse because I can use it to quickly examine Java, Ruby, C++, PHP, etc with all the plugins I've installed over the years.? If you have a Windows partition/VM and some extra cash, Visual Studio can help too.
  • Linux rescue CD from your favorite flavor - That way you can boot into an OS where you can do whatever you want, including inspect partition tables, mount various drives to access content without needing passwords, and generally, take the machine's possibly broken configuration out of the equation to separate out hardware and software issues.
  • Password crackers and recovery tools - This one can be ethically questionable, but when you need some files that some developer left in their account and they've left the country for a 1 month vacation, a customer will be begging you to break them out.? I recommend something to recover BIOS passwords, a zip password file cracker, and if you want to lug a big drive around, a generic password cracker that uses rainbow tables to break systems protected by weak hash-based encryption.

That's a good starting list.? I may post again if I add other tools that may be of interest, and feel free to add your own suggestions in the comments.

Syndicate content