One of the hobby horses of this blog is that debugging is hard because of human problems, and not necessarily because of computers. A recent story linked on Slashdot regarding problems with multiple computer failures aboard the international space station is a great illustration of that problem, in a very high stakes debugging situation.
From Space Station: Internal NASA Reports Explain Origins of June Computer Crisis:

During the first days of the computer failure in June, the station's atmosphere control system seized up. The failure also knocked out the autopilot's ability to fire maneuvering thrusters to hold the station steady during the undocking of the space shuttle, which had arrived on 10 June. The terse description in the NASA internal technical report on the crisis, obtained by IEEE Spectrum, put it this way: On 13 June, a complete shutdown of secondary power to all [three] central computer and terminal computer channels occurred, resulting in the loss of capability to control ISS Russian segment systems.

That's really, really bad.But what's worse is the response to the problem:

Russian officials were quick to blame NASA for 'zapping their computers' with 'dirty' 28-volt power from a newly installed solar power wing. Another Russian explanation was that the expanded station structure (the main purpose of the shuttle visit) might be excessively charging up due to its orbital speed through Earth's magnetic field. These were the first of many bad guesses by top Russian program managers that would distract engineers trying to get at the real problem. [My italics added for emphasis]

Now I'm not willing to hold the US team as blameless as this author does (a NASA team member) despite the clear faults with the Russian systems that were eventually uncovered. However, I agree with his characterization of these problems as "guesses" and that they were "distracting". Interestingly, in this case, they took an action that fixed the problem, although no one could explain why:

The initial assumption was that some external interference, such as noise on the power supply, was responsible for generating false commands inside the computer system. On the assumption that the bad commands were coming from inside a power-monitoring device, the crew bypassed it on two of the three downed computers, using jumper cables. By the time the shuttle undocked on 19 June, the computers began to function normally?or so it seemed. Replacement parts were quickly manifested on a robot supply ship, while ground engineers wrestled with the fundamental question of cause and effect.Analysis teams still had to determine why the computers failed, and why the jumper cables seemed to fix the problem. More important, they needed to know whether the problem really was fixed, or whether something could again trigger the systemwide crash of the supposedly triply redundant architecture.

In the end, it turns out that they had guessed right about the source of the problem being bad commands from the power-monitoring device, but the why took some explaining.Essentially a short-circuit resulting from corrosion, itself the result of poor design, was causing the power-monitoring device to send a "shut down" message to the computers, which was a "misfeature" designed to protect them from voltage spikes that would damage them.? Instead of there being triple redundancy, all three were brought down by the same underlying cause.
This situation first brings to mind one of the rules from Debugging Rules: Stop Thinking and Look! In this case it turns out by simply opening the case, it was immediately clear that things weren't right: electronics were wet.

In the weeks that followed the crisis and apparent recovery, station commander Fyodor Yurchikhin and his fellow cosmonaut Oleg Kotov disassembled the boxes and cabling and inspected every angle of the hardware, occasionally assisted by their American crewmate, Clayton Anderson. Multiple scopes and probes had failed to find the flaw, but their eyes and fingers eventually did.
The connection pins from the power-monitoring device they'd bypassed earlier, they found, were wet?and corroded. The final report described the change in appearance of fasteners on one box's connectors and noted the presence of deposits and residue on the housings, and residue and spots on the contact surfaces.

If they had just stopped theorizing and looked inside, it would have been immediately clear that something was very, very wrong. From a Distance Debugging standpoint, the issue was "the danger of the unfamiliar", more technically know as Ingroup Bias. Here the issue was directly attributable to general mistrust between US and Russian astronauts, and the social boundaries that exist between them.? However, in any system, we will tend to blame "other" systems, the ones that we didn't build or those of which we have less knowledge. I would guess that in this case, even if the the two groups were both US but had developed their technologies independently, the finger-pointed would have been the same.
In any case, it is thankful that they were able to workaround and eventually resolve the problem, but it shows how pervasive debugging problems are. Even the most highly trained astronauts and scientists resorted to the same heuristics and biases that tend to guide the rest of us when we go about fixing problems.

I am constantly admonishing companies to take IT seriously, in the way that they would take accounting or marketing seriously. In fact, I would argue that technology management should be a core part of a modern business curriculum beyond generic "Management of Business Technology" courses. Most businesses I know take only a passing interest in trying to keep tabs on where they are in terms of the applications they are running, their infrastructure, and in particular, having any idea what their employees could use to do their job better. So when I saw this post on TreeHugger about major power savings being found simply by turning totally unused machines off, it seemed like a nice metaphor for corporate IT problems in general.? From the article:

In some companies it may be the case that there are many servers that are left on for no good reason, simply to serve legacy applications. Mark Monroe, Sun Microsystem's director of sustainable computing, gave a talk where he explained that they were able to tuen off 10% of their servers in this way. He called the phenomenon ?data center drift?.

He went on to explain that a survey had found two companies had 504 ?mystery machines? out of 4,300 servers. When they were turned off they had no actual impact on the companies operations. This is something that should be simple to implement, but can have a dramatic impact on energy bills.

I particularly love the characterization of the phenomenon as 'data center drift', as it transforms incompetence into something that sounds almost natural.? In the above example, nearly an eighth of their servers were simply doing nothing that affected a single person if they were turned off.? Imagine if the accounting staff did an audit and determined that an eighth of the corporate budget was just being thrown into a giant pit and buried every day, but that they didn't notice because it used to be that the money was funding things, and it slowly shifted to the pit.? They would probably be fired on the spot.? I somehow doubt any of the IT staff were taken to task for this gross oversight.

When choosing an avenue of attack during the Isolation stage, it's important to keep in mind two different dimensions: probability and testability.? Probability is your informed estimate of how likely you believe a particular problem is the cause. Testability is how much time and effort you suspect it will take to rule that particular cause in or out.? Ideally, your most probable causes would be the most testable, but it rarely works out so nicely.? Ultimately, you have a simple 2x2 matrix of possibilities, and you can place each theory in one of the sectors:

Probability vs. Testability

The first theories to try are of course the Highly Probable-Easily Testable ones,? labeled "Ideal" in the matrix.? Next is a judgement call.? If you have some very Easily Testable theories that are fairly Improbable, it might make sense to take an hour to knock them all out.? These are labeled "Why Not?". On the other hand, if you have a Highly Probable theory that might take some effort to test, it could be much more valuable. These are labeled "Necessary Evil".? Finally, if you've exhausted all other possibilities, it's time for the Low Probability and Hard to Test theories, labeled "Last Resort".? Before you start trying to follow up on these ones, take another long look at what has already been tried, the data you've already collected, and any other information that might help you see a glimmer of another possibility before wasting a lot of effort on an unlikely theory.? However, sometimes there's no other choice.
Rather than just assigning a label to each theory, it can also help to simply draw out the matrix above and plot your theories on an X/Y axis, where upper-left is best and lower-right is worst.? This can help you easily see both where you ought to start, and how much work you are in for before starting in on an extended isolation exercise.

Recently, Steve Ballmer made some comments regarding social networking that were widely ridiculed (and probably more appropriately, labeled as self-serving since Microsoft has been looking to acquire a stake in Facebook and would be happy to drive down the price):

"I think these things [social networks] are going to have some legs, and yet there?s a faddishness, a faddish nature about anything that basically appeals to younger people," Mr. Ballmer told Times Online yesterday.

On his blog, Marc Andreessen wrote a response making use of a common conceit: applying comments about a modern phenomenon to historical phenomena in kind of a reductio ad absurdum argument. A brief excerpt:

"I think these things [televisions] are going to have some legs, and yet there?s a faddishness, a faddish nature about anything that basically appeals to younger people."

"I think these things [hip hop music] are going to have some legs, and yet there?s a faddishness, a faddish nature about anything that basically appeals to younger people."

"I think these things [mobile phones] are going to have some legs, and yet there?s a faddishness, a faddish nature about anything that basically appeals to younger people."

Now, I assume his point is that so-called disruptive technologies are often dismissed at the time as a fad and, quite frankly, it can be very hard to tell a fad from something truly transformative. This brings up a larger question though: is social networking more like television, hip-hop, and mobile phones, or is it more like video arcades, pocket bikes, and the "Rachel" haircut? What features of a trend might we use to determine this?

I started considering this question recently because I have a guilty secret: I don't really get mainstream social networking site even though I make heavy use of technology in general. I certainly understand why teenagers and college students use them, and I take part in lots of implicit social networks via listservs and other online communities, and I've even made use of some sites like last.fm, but I don't see why I would care to get seriously involved in Facebook or MySpace (Andreessen's company, Ning, makes more sense to me for reasons that will become clear in a moment).

My issue is fundamentally about what one might call personal power. I remember reading one of the Carlos Castaneda books when I was younger, and Don Juan at one point counsels the narrator to cut off his ties with friends and colleagues back home in an effort to erase his personal history and thereby increase his personal power, which is diluted by his past and relationships (I'm strongly paraphrasing; I read this book probably 15 years ago, but this part stuck with me for some reason).? However you feel about the mystical mumbo-jumbo in these books, it's hard to not see the kernel of truth in this idea: you can increase your perceived status simply by limiting others' access to you.

This brings me back to the dilemma of social networking sites, and a rule that I've just made up that I'll call the inverse social power rule.? Simply put, the likelihood of finding a contact on one of these sites is inversely proportional to the quality of the contact. The problem is that people who have power have no need of additional access paths to themselves, while those who are trying to rise in the ranks are much more willing to be "promiscuous" in allowing social access in the hopes of making a connection with someone of higher status. Blogging uses the same logic: I divulge information about what I'm doing and thinking in the hopes that I might attract smarter, more interesting people to say or think nice things about me, and maybe give me some money to do some work for them (hint, hint).

So if I am a high-school or college student, and therefore I am generally on the weak side of the power equation in most relationships, social networking makes sense.? If I'm a CEO or a celebrity, I want to limit my access as much as possible and avoid social networking like the plague, since that's just giving the milk away for free.? If I'm somewhere in the middle, I want to be more like the CEO, not more like the college student, so I want to make extremely judicious use of these types of sites lest I give the appearance of a weaker social status.

You'll note that Marc Andreessen does have a page on MySpace, but he hasn't logged in in nearly 2 years, and has 0 friends, revealing basically nothing about himself. Now that's a MySpace page for a CEO.? As far as I can tell, there is no real Steve Ballmer listed there, although there are at least 2 parody profiles.? It's pretty much the same story on Facebook, although Marc does list his companies.? I somehow doubt he would respond to a poke though.

And therein lies my problem with social networking sites, and why I tend to agree a bit more with Ballmer than with Andreessen on this one, although I think Marc has a very different perspective because Ning is for building sites that allow for topical rather than status-oriented social connections, which breaks my power rule completely. I honestly believe that at this point in my life and career, I am better served by avoiding them than by joining them, and I wonder how many upwardly mobile 20-somethings are going to be frantically deleting their profiles from these sites when they realize that they have moved to the strong side of the power equation.

In conclusion, I have no doubt that entreprenurs can profit from social networking sites since they do have benefit to the ones that need them.? However, I ultimately believe that they will not have transformative power because unlike a technology such as a cellphone, which has become essential as a tool for increasing one's social status and only becomes more vital to the owner over time, social networking sites will continue to lose members just as they are becoming truly valuable, draining their ability to make a significant cultural impact.

Fantasy football is an incredibly popular pastime for many fans this time of year. I caught the bug a few years back, and over my time as a fantasy owner, I've noticed a lot of similarities between the lessons I've learned managing my fantasy team, and those I've learned managing software projects.? Here are a few big ones:

  • ?The mythical man-month - Sure, we've all read the book and pay lip service to the concept that people and time are not fungible, but nothing will make this hit home like a disastrous 1-for-2 player trade that looks fair, but actually cripples your team.? Each week, you can only start a limited number of players, and so it is in your best interest to concentrate the talent in as few players as possible.? It's easy to get suckered into a trade where you trade a 16 point/week player for 2 10 point/week players.? Don't do it.? Even though the total output is higher, you are getting a terrible deal.? It doesn't matter if your bench players score 80 points every week if your starters do just the same.

    This is true of your software team as well: 8 great members are infinitely better than 16 mediocre ones, yet software teams tend to "staff up" to solve hard problems instead of just trying to concentrate talent in a smaller number of team members.? It's not as easy to address this problem in real life as in fantasy football, but it's still a worthy goal.

  • Look out for bye weeks - It's all too easy to salivate over the prospect of a great player dropping into your lap at some later round of the draft, only to discover that you will wind up with both of your starters on a bye the same week, which cripples your chances of a win that week.? What's the equivalent in software terms? Choosing multiple outstanding team members who have the same holes in their game.? Three great coders are useless if they are all terrible at design.? You have to learn to put together the right mix of skills.
  • Dance with the one that brung ya - You can easily outsmart yourself if you start worrying about weekly matchups, such as a running back facing a tough run defense, because you will pull good players in favor of mediocre players who look like they have a better chance to succeed.? Except in a few rare cases, just start the same set of good players as much as possible, and you will be better off in the long run.? In software projects we have a tendency to call on a "specialist" to look at a problem, such as calling in a DBA to try to help us address database performance problems.? While a true expert can occasionally offer some insight (much like a bench running back can occasionally put up big numbers on a bad defense) , generally you just wind up angering the team that worked hard on the application by discounting their ability to work through it themselves, and then getting generic advice from the expert who knows very, very little about your actual application.? "Playing the matchup" by calling in the specialist is a risky play when your own staff already understands the problem, and is hopefully highly skilled themselves.
  • Balance your risk - It's easy to draft a team that is all guys with big potential upside: rookie running backs, breakout stars from the previous year, an up-and-coming defense, a no-name "sleeper" tight end.? It's fine to take some of these guys, especially if you have a really solid top of the draft, but you have to balance your risk and take a bunch of established performers that will give you a base 50-60 points every week, and then swap in the gambles that pay off.? In the same way, your software team needs to consist of some sobering, get-the-job-done influences so that you meet your deadlines, while also bringing on some lateral thinking, risk-taking staff that will solve problems in novel ways.? They'll sometimes take a part of the system off a cliff and have to be reigned in, but that's what the established low-risk team members are for.? The right balance is critical.
  • Draft a team you like - I can't remember where I picked up this piece of advice, but the idea is simple: rather than (or more likely, in addition to) running elaborate analyses of who the best players are or mock drafts to try to pin down who you might get, simply make a list of players you'd be happy to have on your team, and go after those players.? Nothing is worse than following a strict value-based draft spreadsheet to get the "best" player you can at each place in the draft, only to realize that you have a team that you aren't really that excited about.

    A software team is just the same, and the effect is exacerbated because you don't have to actually interact with the people on your fantasy team.? When interviewing or selecting team member for a project, ask yourself: "setting aside what I see in the resume, the quality of their sample code, and their demeanor, would I actually be happy with them on my team? " You will be surprised how often with an apparently great candidate the answer to that question is no, and how often with a marginal candidate the answer is yes.