What does it mean to say that code must work all the time?

Everyone knows that we are becoming more and more dependent on computer software. Our lives can be upended (or even ended) by software malfunctions in many different areas, including avionics software, medical software, financial software, automotive software. Even such a small program as an alarm application on a smart phone may cause havoc if an alarm fails to go off. Generally when we buy products of any kind, we expect them to work, and indeed if they don’t work, we expect to get them fixed under warranty, and if the failure is fundamental, we expect to see the manufacturer face consequences for producing faulty merchandise.

So it seems reasonable that we take an attitude that code we depend on should work all the time. Otherwise how can we depend on it? Despite this seeming an obvious principle, in practice we tolerate serious lapses in software all the time. I have been keeping an informal data base of news articles which use the word glitch to describe serious software failures. A software “glitch” caused hundreds of dangerous prisoners to be released in California. A software “glitch” caused a major bank in England to seriously mess up the accounts of hundreds of thousands of customers. A software “glitch” caused the New York Times to solicit all their customers with an offer of a new subscription for those who had never subscribed before. The list goes on and on, ranging from the merely amusing to the seriously worrying. The word “glitch” is an interesting one. A normal meaning is some minor error, with an implication that it could not have been easily avoided. The online definition at dictionary.com is of particular interest:

a defect or malfunction in a machine or plan.
Computers. any error, malfunction, or problem.

So isn’t that interesting? Normally it is a defect or malfunction, but computers are special, any problem is to be referred to as a glitch, and indeed this is typical usage that I see in newspaper articles.

These articles feed a viewpoint that expects all big computer programs to have problems in them. I myself argued this position as an expert witness in a law case: “Yes Judge, this operating system had lots of serious problems, but Judge, that’s industry standard practice.” Well that was decades ago, and there is quite a difference between a preliminary beta release (as was the case at issue here), and a final release. Nevertheless today millions of people are using operating systems on PC’s that routinely crash and burn, and people come to expect this level of unreliability. In one infamous case in the navy, there was an inquiry into a software problem that had disabled the computers shipwide, and the conclusion was that it was an application making an illegal call to the operating system, and the operating system was exonerated on the grounds that it was not at fault.

I have even heard an eminent lawyer argue that product liability statutes needed to be rewritten specially for computer software, because they were drawn up with the assumption that it was reasonable to require manufacturers to produce safe reliable products, and that was obviously impossible for software.

I have two thoughts here. First, the viewpoint that it is impossible to write software that works all the time is clearly flawed from experience. We write avionics software that is remarkably reliable, and we all entrust our lives to such software when we travel by air. Well OK, it doesn’t work all the time, and we have had some failures, none of which have caused loss of life, but it works well enough that it is not the weak link in the chain, and that’s really what we are after. Nothing is 100% perfect, but there is no reason that we should tolerate software that is significantly less perfect than other components in a product with many components, some of which are software.

Second, it is this tolerance that is in my view at the route of our difficulties. In the movie Network, there is a memorable quote “I want you to get up right now and go to the window. Open it, and stick your head out, and yell, 'I'M AS MAD AS HELL, AND I'M NOT GOING TO TAKE THIS ANYMORE!” I feel the same way about tolerating unreliable software. Now it’s all very well to make statements like this if the technology exists, but if it’s impossible, then we are in a position like an extreme environmentalist demanding we switch to all renewable energy sources RIGHT NOW! But in the software case, we do have the technology for generating reliable code. For example the DO-178 standard used for writing avionics software is well established, and more recently we are making great advances in understanding how to use formal mathematical techniques for proving that software behaves as we expect it to. Do these approaches add to the cost of producing software? Probably so if all you count is the effort of rolling the first version out of the door, but if you include the horrendous costs associated with serious malfunctions later on in the product cycle, then a strong argument can be made that using such techniques saves money. Also, using such techniques may make it harder to get new versions of software out of the door. But do we really need to replace working software with a new version once a year? I bet people would be happier with software that always worked and did not get updated quite so often with new fancy features!

There are many areas in life where we get what we deserve. I suspect that our tolerance of junk software leads to another instance of this phenomenon.