When testing an application, because of slight differences between the test environment or usage pattern and the real system, we often end up discovering "bugs" that would never happen under normal conditions. These bugs tend to be surprising because we wonder how the problem could have escaped our noticed for so long or how it could have been introduced.  Here are two examples of these bugs, followed by an explanation of how the "artifact" was created, identified, and resolved:

  1. I was working on a system where users had a foldering structure stored on the server. We were testing server performance, and were simulating a large number of users creating lots of folders over a long period of time. Things were getting slower and slower and it looked like there was a serious performance problem.
  2. Recently, I was working on porting an application from an older version of WebLogic to a newer version (9.2). We have a load testing rig that simulated the effects of many users calling the system over time. Everything was going smoothly with the port until we started ramping up our testing with the simulated clients. Each client was using SSL to connect and a certificate to authenticate, and the server should be keeping track of the who was authenticated for a given call so that their actions can be associated with them. Our test rig relied on multiple simulated users connecting from a single physical machine (a fairly standard practice for load-testing), and when we tried it with the updated version, suddenly calls coming from the same machine were seeming to have somewhat arbitrary credentials associated with them, as if the server code was not thread-safe and the authentication-related code was totally broken.

Now, the thrilling conclusion:

  1. We had to first cut apart the size of the data being created from other factors, such as length of time since test initiation (since data sizes tend to grow as time goes on). When we went and looked at the actual data being created, we noticed that we hadn't set any limits on how many subfolders should be created for a given folder, with some folders winding up with 1000s of child folders, something that was deemed very unlikely to happen in practice (and in fact it never has). We made a note of the fact that a performance problem could arise if a user chose to create a huge number of child folders, and changed our test rig to create deeper folder nesting rather than wider folder nesting keeping the number of folders the same while avoiding an unlikely usage pattern.
  2. While the original theory was that we had somehow failed to port our code to the new WebLogic verson correctly, this simply caused us to chase down a lot of dead ends. We decided to start running only one client per physical machine to see if the problem appeared (after putting in lots of extra logging on the server and writing a very simple, repeatable test to demonstrate the problem). The problem disappeared in the multiple machine test, and it became clear that the issue was related to running multiple clients on the same machine. At this point we were tired of dealing with the issue and accepted this workaround, since in practice, we never had a situation where multiple clients would be connecting from the same machine and authenticating as different users. We still don't know if WebLogic somehow associates credentials with a particular IP address, and if so, if there is some way to turn this off. To really verify this theory we would need to set of a machine with multiple IP addresses assigned to the same NIC, and somehow get different clients to use different IP addresses.

What's the moral of the story?

  • When you uncover a bug during testing that surprises you in that you would have expected to see it under production conditions, go back and verify that you are actually trying to do something that the production system does.
  • In the case where the bug is something that hinders your ability to test, but would have no effect on the actual system (as in bug #2 above), it can be a very tough call to determine how much energy to put into fixing it.
  • While testing for a broad range of conditions and situations can be beneficial for a system, especially in case where you might anticipate a future problem (as in bug #1), you can also wind up plugging a lot of holes that won't ever leak.

I think my favorite artifact story is the one retold by Steve McConnell about a team trying to get better performance out of their OS using some profiler data:

Bentley also reports the case of a team that discovered that half an operating system's time was spent in a small loop. They rewrote the loop in microcode and made the loop 10 times faster, but it didn't change the system's performance-they had rewritten the system's idle loop.

A method eating up 50% of the execution time sure looks like a nasty bug, but it was only an artifact of the system design. Keep this lesson in mind next time you see something so shocking.

In the late 90s, I started hearing a lot about Linux and wanted to give it a shot. A friend of mine had the CDs for Red Hat 4-point-something and he lent it and a copy of his 2.5-inch thick Red Hat Linux Unleashed (or some such title) book to me, and I undertook a project that would ultimately change my life as a hacker and a computer scientist: I tried to get Linux installed on my "state-of-the-art" Pentium II computer. Installing Linux back then, while I'm sure it was light-years ahead of where many practitioners started, was still difficult enough that I actually had to learn something about computers to make it happen. The whole process took a couple of weeks, filled with reading, research, trial and error, and ultimately, it was the beginning of a process of knowledge and skill acquisition that continues today.

Here are a few of the things that I learned in those weeks:

  • Despite having an installation process that walked me through the the steps necessary to install and configure the system, there were many questions along the way. How did I want my disk partitioned? How much swap space? Which packages did I want installed? How did I want my network configured? Each question sent me off on an investigation about the possibilities, and the advantages and drawbacks of each option.
  • I wanted a dual-boot system since I wasn't really ready to abandon my Windows 95 use yet given it was all I really knew. This meant learning about the boot sequence, boot loaders, the MBR, lilo, disk partitioning, and even a little about disk geometry.
  • Of course, I wanted an X windows based system so that I could run graphical apps. Back then, autodetection of video card and monitor settings was dicey. To get it up and running meant learning about video timings, how to modify the XF86config file, and reading the arcane spec sheets that came with my monitor and video card to find a compatible setting so that it actually started up.
  • Once I had gotten things installed, the next step was figuring out what I could do with the thing. I had some familiarity with a CLI from my DOS days, but a real shell is a little different. Even simple things such as "how do I run a program?" turned out to be tricky, and learning that I needed to prepend "./" to run a command in the current directory was a revelation.

Like most things, I got better with experience. Pretty soon I was figuring out how to burn CDs (and understanding file mounting, CD disc structure, and some basics of device drivers), download and compile software using the configure/make/make install pattern, and much more. Within a year I was setting up a Linux box to provide NAT for my cable modem (I was lucky to have access to an early cable modem service), meaning I learned a ton about the nitty-gritty of real world networking, iptables (it was probably ipchains then), and how to set up and configure a small home network including local DHCP and DNS services, port forwarding, and much more.

The freely available nature of Linux and it's subcomponents, coupled with the vast resources of documentation and community mean that any self-motivated person can spin up on pretty much any technical topic and actually try out a working implementation to get a feel for how the idea plays out in practice. In the early days of computing, it was this way for the vast majority of users, with the tradeoff being that there was no easy alternative for non-technical users. Since the rise of Windows and Mac as the dominant operating systems, this option has been hidden or even taken off the table for many up and coming programmers.

The fact that you have to "think" to use Linux has been criticized, but few people seem to note the danger of having many developers learn on an operating system that requires little or no thought. Whether or not you agree with the sentiment that Linux makes you think too hard, there is no reason we can't have an OS that is for the "masses" and another for developers who actually care about what is going on on their system, so this isn't so much a criticism of Windows as it is a criticism of computer science programs and software development shops complacency in accepting Windows as their standard platform.

My advice to young programmers is, rather than always working with software that doesn't make you think, spent some time with some that does.  Like learning a new programming language, it's a great way to expand your knowledge of computer science concepts, as well as develop important problem solving skills.  You will be surprised at what you don't know when the configuration  tools and installation wizards are stripped away.

I worked as a computer administrator for a small Mac-based network during my college years. Things ran fairly smoothly most of the time, but one event sticks out in my head from my time there. I was sitting at my desk doing some routine maintenance when one of the staff ran up to me saying, "One of the students said that the computer she was working at has a virus!" Panicked and fearing that I'd forgotten to update the virus definitions or otherwise fallen asleep at the switch, I rushed over to the computer in question. Nothing immediately appeared out of the ordinary, but my first self-preservation instinct was to yank the network cord out of the back, shut all non-essential programs down, and run a fresh virus scan to see what we had been infected with. I paused though, because something didn't seem right.

It occurred to me: what did this student see that made them think there was a virus infection? I know what one looks like because I've had to fix infected computers and seen the bizarre unkillable processes, random pop-up windows, sluggishness, etc. However, most people when they see a virus think of a giant window popping up and critters dancing around your screen and giant text reading "You have gotten the PDQ virus! I will now delete all your files!" I wish that they so readily advertised their presence as it would save me a lot of time. While considering what might have conveyed the presence of a virus, I glanced at the browser window the student had left open, showing a page with a banner ad at the top. The banner was flashing blue and purple and said "YOUR COMPUTER IS INFECTED WITH A VIRUS!!!".  With a chuckle, I explained what had happened to the staff and went back to my normal routine.

Distance Debugging often means that you have to take someone else's account of a situation. It can be easy to forget that you are working from second-hand data and from some one else's interpretation of what they observed. What can you do to help understand others' perspectives and observations?

  1. Be very aware of the "layman's" terminology in common technical domains in order to help clarify seemingly bizarre support requests. For instance, I've noticed that many people use a kind of synecdoche and say "Internet" where they mean "Web". As in "the Internet" is down to indicate that they can't get to websites.
  2. When working with other technical people, think about their background and biases. Do they likely know what they are talking about in the domain they are working in? What evidence do you have that their mental model of the system matches or does not match your own? Are they naturally distrustful of certain applications or systems? Would there be any reason for them to obfuscate or otherwise manipulate the information being presented (for instance to cover their own or another's mistake)?
  3. Reflect on communications negotiations, successes, and failures.  Did you successfully solve a problem because you looked at it through another's eyes?  Were you able to translate from their description to a correct representation of the problem?  Did you get frustrated? Did you miss critical details? What words were used that might be useful to file away in a "translation guide" for dealing with a particular individual or a class of individuals the next time?
  4. Be careful of chronology.  We have a tendency to forget when a certain piece of knowledge became known to a certain person, and can either come to erroneous conclusions or dismiss valid ones by saying, "They wouldn't have done X because they knew Y", when in fact Y couldn't have been known by them at the time.

Cultivating theory of mind skills will not only serve you well in a debugging setting, but can help in almost any interpersonal situation.  Most of the time, we take for granted our ability to consider the minds of others but when we fail to do so, we risk making serious errors in judgment.

Let's say that you want to win the lottery (who doesn't?). You have two problems that need to be overcome:

  1. You have to guess the right numbers.
  2. Even if you guess right, you have to split the jackpot with a bunch of other lucky folks who also guess right.

Most people only worry about part 1, but if you are going for the best expected value, which is a combination of odds of and payoff, then you need to worry about 2. For instance, you would probably rather be the sole winner for a $200 million jackpot than be forced to split it with 20 other people (even though $10 million is still a ton of money).

The ironic part is that there is little or nothing that you can do about 1, but you can do something about 2. Here's a simple tip: play numbers greater than 31. A common strategy is playing birthdays as one's numbers because they are easily mapped to lottery selections and perhaps in some primordial way, they provide a symbolic "offering" of your loved ones to the lottery gods. Assuming every combination of numbers has the same odds of winning, you might as well play some numbers that are unlikely to be played by others so that if you win, you reduce the likelihood of having to split the payout. Since birthdays must consist of numbers that are less than or equal to 31, you increase your expected value with some well-chosen numbers.

What does this have to do with debugging? It illustrates that you can often improve your chances of success, in addition to saving time and energy, by remembering that computers are built, programmed, and used by people. Psychologists use the term "Theory of Mind" to describe people's ability to conceive of others as having mental states, intentions, beliefs and so on. The minds of those involved in the system, particularly users, should be taken into account when trying to understand reported problems.

Tomorrow: Using assumptions about mental states to fix things faster

I found a nifty drop-down menu that I want to use to conserve space, but right now everything is a little wonky so bear with me. Eventually I'm going to compress everything down, remove the bottom boxes, and create more space for the content, which I think will be easier on the eyes. Until then, there is some duplication and the menu is hard to navigate, so please accept my apologies.

Follow up: I've finished the main revision and I'm very satisfied with the results. I think there is less screen clutter, at the expense of less information. If you have problems viewing or have other suggestions, please leave them in the comments.