The Limits of Testing in Safe Systems

How faults become failures

A simple system with a fault

A simple two-threaded program

Traditionally, proofs that software systems meet safety standards have depended on exhaustive testing. This method is adequate for relatively simple, deterministic systems with single-threaded, run-to-completion processes. Unfortunately, testing is no longer adequate to ensure the dependability of today's multi-threaded systems. Though these systems are deterministic in theory, their complexity forbids our treating them as deterministic systems in practice.

From fault to failure

When we build a safe system, we must begin with the premise that all software contains faults and these faults may ultimately lead to failures.

Failures are the result of a chain of circumstances that start with a fault introduced into a design or implementation. Faults may lead to errors, and errors may lead to failures, though, fortunately, many faults never lead to errors, and many errors never cause failures. Table 1 describes faults, errors and failures.

Table 1: Faults, errors and failures
Fault	A mistake in the design or code. An example might be a design of a protocol that permits a deadlock to occur. Within code, specifying an array as int x\\[100\\] instead of int x\\[10\\] would be a fault, although it is unlikely to lead to an error.
Error	Unspecified behavior caused by a fault in the design or code. A fault in the design of the protocol might result in a deadlock during execution, and the recovery might cause a message that was in transit to be lost.
Failure	A failure to satisfy one of the safety claims about the system, due to an uncontained error. This loss of a message (an error) might be harmless, or it might become the direct cause of a hazardous situation.

Figure 1, adapted from James Reason's Human Error (James Reason, Human Error. Cambridge UP, 1990), illustrates how faults at different points in the development cycle can eventually lead to a failure. We have subdivided the defenses to match the two layers in software: pre- and post-shipment defenses (with attendant holes). Pre-shipment defenses are those validation and verification activities carried out before deployment, while post-shipment defenses are defenses built into the system itself and activated to protect it during use. The causes of every failure can be traced back-at least in theory-to a lacuna at each stage.

When we build a safe system, we cannot prove that the system contains no faults (see below Inherent limitations of testing). What we can do, though, is demonstrate that errors in the system will not cause the system to fail more often or for longer than the limits we claim. Or, to put it more directly, we can provide evidence to support our claims that our system will be as dependable as we say it is.

Sufficient dependability

A safe system is a system that is sufficiently dependable. Because it is impossible to design a system that both performs some useful function and is 100% safe, when we design a safe system we must begin by defining sufficient dependability. It is possible to design a system that is 100% dependable, but only if that system performs no useful function. For example, we could design a train control system that would ensure the train never crashes. Unfortunately, this train would also never move, and would therefore have no useful function-at least not as a train

A system's dependability is its ability to respond correctly to events in a timely manner, for as long as required; that is, it is a combination of the system's availability (how often the system responds to requests in a timely manner) and its reliability (how often these responses are correct). Whether reliability or availability is more important for safety depends on how and for what the system is used.

Sufficient dependability is a precise expression of the criteria against which the system's dependability is to be measured. These criteria must stay clear of facile marketing claims of the five-nines sort: available 99.999% of the time; ergo completely dependable except for five minutes 16 seconds of the year. This type of claim is meaningless unless we offer more information about how failures are distributed throughout the year.

If we make a five-nines claim about a flight control system on an airliner, for instance, our claim has very different implications if the five minutes 16 seconds (0.001% failure) occurs all at once, or if it is spread across one million distinct instances of 316 microseconds (also 0.001% failure). Table 2 below shows some examples of possible implications of the phrase "five-nines availability" for a flight control system.

Five minutes, 16 seconds can mean a catastrophic failure, while one million distinct instances of 316 microseconds may have no effect on the system's dependability and may even go completely unnoticed. In fact, the airliner may well tolerate a flight control system that guarantees only four-nines availability (99.99%), if unavailability is distributed over one million instances of 3.16 milliseconds per year separated by sufficiently long instances of availability. We should also note the duty cycle of the software. Few flight control systems run for more than 20 hours at a stretch, after which they can be restarted, forcing rejuvenation.

Table 2: Five-nines availability as it might affect a flight control system. Failures per year Duration of each failure 1 5 minutes 16 seconds

10 32 seconds 100 3.2 seconds 1000 316 milliseconds 10,000 32 milliseconds 100,000 3.2 milliseconds 1,000,000 316 microseconds

In addition to helping ensure that a system is as dependable as needed to ensure safety, a precise expression of sufficient dependability helps us manage the cost of designing and developing the system because it allows us to build the system so that it is just sufficiently dependable. We do not waste time and effort increasing, for example, reliability beyond the safety and performance needs we have identified.

Thus, a careful and comprehensive definition of dependability requirements serves a dual purpose. First, it provides an accurate measure against which a system's safety can be validated. Second, by clarifying what is indeed functionally required, it eliminates vague (and therefore meaningless) requirements, and removes from the project bill the effort and cost of trying to meet these requirements.

Inherent limitations of testing

Testing is designed to detect faults in the design or implementation indirectly by uncovering the errors and failures that they can cause. Testing is of primary important in detecting and isolating Bohrbugs: solid, reproducible bugs that remain unchanged even when a debugger is applied.

Even in that limited capacity (that is, ignoring Heisenbugs, which by definition are not reproducible), testing has an inherent limitation that we ignore at our peril: No matter how simple the system, testing can never prove the absence of errors.

Testing can only reveal the presence of errors. While it can increase our confidence in our system, it cannot provide convincing evidence that the system is fault free.

To illustrate this point, we can look at a simple system designed to control an elevator and its doors in a three-story building. This building contains a single elevator. On each of the building's three floors a door allows people to step in and out of the elevator. A central elevator controller sends signals to the cage to cause it to move up or down.

To keep our example simple, we have our elevator controller ignore requests from the building occupants to come to their floor: if this were a real building the call button next to the elevator might-as we've all suspected at some time-have no effect whatsoever on when the elevator arrives, but lights up just to make us feel good. Also to keep things simple, our controller doesn't check if the elevator doors on each floor are open or closed.

Figure 2 shows a state diagram for our elevator controller, which, in fact, contains a fault that may not be immediately apparent (Adapted from B. Berard et al, Systems and Software Verification. Berlin: Springer, 2001). This controller system is single-threaded and it is simple enough for us to uncover the fault through testing-as long as we test the right state transitions.

A potential condition we might wish to test is that no door should open or remain open unless the elevator is at the floor with that door. This condition could, in principle, be tested, but it would be tested only if we have thought of the dangerous condition. If we do not think of this condition, we cannot test for it, and someone just might find an open door and fall down the elevator shaft. Further, even if we design the system so that doors open only when the elevator is at the appropriate floor, the system as we have designed it here fails to ensure that someone does not get stuck in the elevator.

If, for example, we get on the elevator on the bottom floor, the controller can send us on an endless journey from floor to floor. The elevator doors never need open. To save ourselves, we would have to find a way to get someone outside the system either to inject an !open instruction when the elevator reaches a floor and before the controller issues another !up or!down instruction, or to force the controller to issue an !open instruction after n ups and downs, in much the same way that a telecommunications network drops packets that cannot be delivered after nhops.

What we learned from Ariane 5

This simple elevator controller scenario described above underlines one of the key challenges of verifying even very simple systems. Testing can reveal the presence of faults onlyif we have anticipated these faults. The real-life case of the first Ariane 5 launch illustrates this point perfectly.

Thirty-seven seconds after it was launched on June 4 1996, the European Space Agency's (ESA) new Ariane 5 rocket rained back to earth in bits and pieces. This incident has become one of the best known instances of software failing, even though it had been exhaustively tested and even field proven-in this case, more accurately, sky-proven. The Ariane 5's Inertial Reference System (SRI, Système de Référence Inertiel ) had originally been designed and tested in the rocket's predecessor, the Ariane 4. Butas the subsequent investigation revealed, the SRI could not function correctly in its new context, and no one had thought to test the effect of the Ariane 5's greater acceleration on the SRI.

Fortunately, the Ariane 5 incident did not cause any fatalities. It did cost the ESA some US $370 million and a great deal of embarrassment, however. On the other side of the balance sheet, this incident provided a dramatic demonstration of the limitations of testing. In the long run it may lead indirectly to saved lives and great savings, because it led to increased research into other means of proving that a software system meets its safety requirements.

The value of x

To further illustrate the point that testing cannot prove the absence of faults, while writing this article I been running the simple, two-threaded, program shown in figure 3 (Adapted from Mordechai Ben-Ari, Principles of the SPIN Model Checker, Springer, 2008). In this program, two threads increment the global variable x. On first examination of the code, it appears that when the program finishes, xwill always have a value between 10 and 20, its exact value depending how the threads interleave during their execution. I have tested this program 10,000 times and counting, and so far the results have confirmed this observation: each time the program has ended the value of xhas been between 10 and 20.

In fact, for many years, this very example was used as a class exercise to teach students about threading-until an astute student discovered an error. In fact, x can end up with values as small as 2. Now, the same example is used to demonstrates the human inability to understand the complexity of multithreaded code, and inability of testing to detect unlikely but possible cases. Who, in practice, would run 10,000 tests on such a trivial and simple piece of code? And there is absolutely no guarantee that any amount of testing would reveal the fault.

Testing of non-deterministic systems

The examples above show how testing can fail to uncover even good, solid and reproducible Bohrbugs in simple, deterministic systems. What can be said, then, of testing for more complex systems that are, in practice, non-deterministic?

Theoretically, any software system is deterministic; that is, the number of states it may assume and the number of possible transitions between these states are finite. In practice, however, software systems simple enough to be treated as deterministic systems belong to the distant past-if they ever existed at all. In fact, the Engineering Safety Management Yellow Book 3,Application Note 2: Software and EN 50128 published by Railway Safety on behalf of the UK railway industry even goes so far as to state that "if a device has few enough internal stored states that it is practical to cover them all in testing, it may be better to regard it as hardware." (Application Note 2: Software and EN 50128. London: Railway Safety, 2003. p. 3).

Software systems are now so complex that in practice we cannot know or predict all their possible states and state changes. Multi-threaded systems running on multi-core systems have so many possible states and state transitions that we must treat them as non-deterministic systems. They cannot be exhaustively tested, and of necessity testing must be a statistical activity.

Because today's systems contain such a large number of states and trajectories through the state space, most of the bugs that remain in the code after module testing are not Bohrbugs, but Heisenbugs, which we cannot reproduce because we can determine neither the precise (multi-dimensional) state that lay at the start of the error, nor the trajectory from that original state to the state that directly caused the failure.

If not testing, then what?

What, then, are we to do? First, we should step back and take heart, noting that software is not always the guilty party. Though it must take the blame for many costly and highly publicized failures, such as the first Ariane 5 launch and the Mars Global Orbiter failure in 2007, software can also take a little of the credit for some spectacular successes. For instance, when US Airways Flight 1549 lost power in both its engines in January 2009, the flight crew was able to ditch in the Hudson River because the flight control software continued to function correctly and allowed them to control the plane. Hardware (the engines) had failed, but the 155 people on the Airbus survived. They owe their lives to the crew's cool heads and decisive actions-and to a well-designed and sufficiently dependable flight control system.

Second, we should remember that if testing is not sufficient validation, it is a necessary part of our tool set. We can learn a lot through testing, especially when we apply statistical analysis to our test results, and use techniques such as fault injection to estimate remaining faults in a system and to observe how our system behaves under fault conditions: Are the safe shut-down or recovery processes correctly invoked? Are component failures properly contained and the overall system adequately protected?

And, finally, we should incorporate other tools in our arsenal: quality management processes, fault-tree analysis using methods such as Bayesian belief networks, techniques such as static analysis and design validation that detect faults directly, and so on. It is worth noting that for the simple two-thread example above (which, incidentally, contains more than 77,000 states), formal design validation using a technique such as Linear Temporal Analysis can immediately find a sequence of 90 steps that leads to x finishing with a value of 2.