Technically speaking, blogs are supposed to be places where you talk about things you've seen or read and comment on them, at least in one connotation of the word, which I rarely do here. However, I came across this article, discussing the origins and popularity of the "brain teaser" interview/recruiting format and seemed like a good way to return to my blogging roots.

I have previously railed against brainteasers as I feel that it has little or nothing to do with what software engineers actually do. In this article though, in addition to the classic brain teasers, it also discusses the use of estimation problems such as "How much would you charge to wash all the windows in Seattle?". The article failed to mention the historical context for these questions as Fermi Problems, but it offers a similar justification, via a Google employee:

Such questions are more relevant to a high-tech job interview than you might think. "Employers want to see if you can make an estimate in the ballpark, within an order of magnitude," says Mark Jen, a former Google employee who is now a program manager at Tagged.

and it goes on to posit:

Coders are constantly making educated guesses rather than calculating exact answers, so a good interview should probe how well a candidate handles such estimates. That's why Amazon.com interviewers, for example, have been known to ask job candidates to guess how many gas stations there are in the United States or to ballpark that bill for washing all of Seattle's windows.

I agree that we often need to make educated guesses instead of direct calculations, so it certainly is a more authenticate assessment than a brainteaser (in my opinion). The bigger question is: is this a useful skill to test in an interview setting? I believe that if the original intent of the Fermi Problem is actually observed, then the answer is yes. To me, that means two things:

  1. Ignoring the actual generated estimate, or inputs used to the estimate in favor of looking at what the inputs were - In the classic Fermi Problem "How many piano tuners are there in Chicago?" the fact that someone wildly under or overestimates the population of the city is less important than the fact that they chose that as an input.
  2. Paying attention to where they feel they need more information, and probing about how they might obtain it.

In addition, I would suggest choosing more domain-specific estimation problems. Instead of the cost to wash the windows of Seattle, ask them to estimate the number of bytes that are actually transmitted across the network to load a 100K HTML file in the browser, or the amount of power needed to keep a 1000 piece server farm at 90 degrees for an hour. In addition to understanding their estimation process, you will be able to see their mental model for the realization of these operations in the real world, which is the part I personally feel is the most critical.

Isolation of a problem can be a tricky thing.? If you have the luxury of the ability to actually change the system and observe the effects of your changes, an easy way to approach the problem is to start from the known bad system run, turn that into a test case that you can repeatedly run, and then start lopping away stuff that you either assume is not related to the problem, or that you have already verified is not related.? You can tell how well you understand the problem by how good your choices in regards to what you slice off.? If you repeatedly make cuts that either break the application or that make the problem unexpectedly go away, then you should probably do some more digging before you do this kind of isolation.

Cutting away can take two forms: either, you can simply skip a step that seems irrelevant, or you can replace something with a mock object (a topic that is too broad for this post) that gives you back some necessary data.? Eventually, the goal is to cut away everything that is not relevant to illustrating the problem.? This is usually some setup to get the right data structures, some actions to make them change in the correct way, and a comparison with the correct result (which will fail).

For instance, let's say that you have a PHP script that is generating bad output.? The page you get back lays out in a completely unexpected way, and eyeballing the HTML source, it's clear that you are not generating the expected HTML.? First of all, your debugging the problem is hindered by the fact that if you are omitting tags or adding strange attributes to things, the browser is covering a lot of that in its attempt to be fault-tolerant.? So you can cut away the browser and replace it with an HTML DOM parser that verifies the output is what you expect.

Next, all the pages have headers and footers and sidebars, which all layout fine, as they do on every other page.? So you cut the includes and such that create those parts.? The DOM is still not what you would expect.? Finally, you cut away some of the calls that insert database data into the layout and replace them with hardcoded text that approximates what you think is coming back from the database.? Suddenly, the DOM is fine.? You hadn't anticipated that the database data was the issue, but clearly it's relevant and you've cut too much.

So you then cut HTML layout generation from the testing completely, and just write tests to look at the values coming back from the database.? It becomes clear that HTML has crept into certain fields unexpectedly and that is throwing off the layout.? At this point, you have isolated the problem, and you have a simple, quick test that only does the thing necessary to illustrate the problem.? This will greatly aid your Repair process, since it will be easy to verify any attempted fixes, and you can easily run the whole real world test case again when you get to the Validation stage.

Debuggers are hindered by the lack of a language for talking about the stages of attacking a problem. When someone says, "I'm debugging that server crash", is it almost fixed? Do they know what the problem is but are unsure how to fix it? Do they even know what the problem is?

To address this problem, I am proposing the following six stages of debugging a problem:

Instantiation - A bug has been found, but it has not yet been clearly defined. In other words, someone has told you something is wrong, but the nature of the problem is not yet understood. The simple declaration of a bug is enough to get into this stage, and the bug remains instantiated until the verification process has begun.

Verification - After the bug has been instantiated, its existence must be verified. This means giving the bug the prima facie test: does the described behavior, on its face, actually constitute a bug? Many bug reports can be thrown out in this stage because they describe the expected behavior of the system (in which case the bug may be a request for change, or a simple misunderstanding), because they describe problems originating outside the application, or because they are so vague as to be impossible to fix such as "System was slow". If the bug appears reasonable, the recreation stage is entered. It can also either be rejected outright, or sent back to the creator for more information.

Recreation - The next stage is recreating the problem in some inspectable way. Originally, I wanted to call this stage "replication", but I don't want to overload that term. Some bugs don't have a natural "replication" mode, but can be recreated. For instance, "query performance is bad on query X". There is not much to replicate, other than to confirm that the problem exists as stated. However, in most cases, this stage will consist of the process of replicating the stated bug through a series of specific steps.

Isolation - This is the process of filtering out all the stuff that is not wrong, and reducing down to the point or points of failure. For many bugs, especially those that were easy to replicate, this is where the bulk of the work is spent. When isolation is complete, you should have a very clear understanding of what is wrong, and how to go about fixing it.

Repair - Once the bug has been isolated, one or more fixes must be applied. It may turn out that the isolation was incorrect, and in many cases, a debugging session will bounce back and forth between isolation and repair.

Validation - Finally, once the bug has been repaired, the fix has to be validated. In some instances, this stage will be trivial due to steps taken in the repair or isolation stages, such as when a test case is used to isolate the problem which now passes, or when a page refresh is all that is needed to see the improvement. In other cases, the fix must be tried in an operational setting, to verify that the thing that you fixed is the thing that was actually broken.

To recap:

  1. Instantiation
  2. Verification
  3. Recreation
  4. Isolation
  5. Repair
  6. Validation

So when someone asks you where you are with the server crash, you can now say, "I've verified the problem and am working on recreation", or "I've isolated the problem and I'm working on repair". This allows others to better understand how much progress is being made, and to increase communication with peers and with management.

During my long blogging hiatus, I've been up to a few things:

  • Kicking off my new business, Distance Software. The website isn't much to look at yet, but I've got the new logo (one is also coming for Distance Debugging shortly), and I'm working with that same designer to create the new site.
  • I've started a new Drupal setup to start to try to create a community debugging site/forum. There isn't much there yet, but I'm going to start porting over content that I've posted here, as well as new material.? Check out FixIt!
  • Ported distancedebugging.com and distancesoftware.com over to new digs on a dedicated server at SuperbInternet.
  • Helping with the planning for BarCampMilwaukee.? I had a lot of fun at last year's event, and I'm hoping to contribute a lot more time and energy this year.? I'm going to have Distance Software sponsor, and I'm also running the BCM site here on my box to give it some bandwidth and horsepower.? We'll see how it holds up!

Look for lots of new posts now that I'm back in the saddle.