Protect Software Components From Interference In ISO 26262 Systems

Automobile safety depends on the correct operation of software systems that are built from many different components. For good design, these components must be isolated, on multiple axes, so they don’t inadvertently interfere with each other. A formal set of standards and guidelines defines design ground rules and procedures to achieve this goal.

The ISO 26262: Road vehicles — Functional Safety standard¹ provides explicit guidance concerning interference, with special attention to the risks associated with interference between components having different automotive safety integrity levels, or ASILs (Fig. 1). It describes mechanisms to reduce these risks, suggests how partitioning must be handled, and specifies how the ASILs of composite systems must be calculated.

1. A typical digital instrument cluster shows information coming from subsystems (components) that provide different levels of dependability.

Interference

Many systems built in accordance with ISO 26262 include components developed to different ASILs. Incorrect behavior in one component could interfere with another component and lead to a violation of safety requirements. Such a safety-requirement violation occurs not just when a lower-ASIL component interferes with a higher-ASIL one, but also when two components with the same ASIL interfere with one another—or even when a higher-ASIL component interferes with a lower-ASIL component. To help manufacturers avoid having to develop all components to the standard of the highest ASIL in the system, ISO 26262-6, paragraph 7.4.10 permits the use of software partitioning.

There are different types of interference, depending on whether the components are actively co-operating or are actually meant to be independent. As this (partial) interference list shows, one component may:

Rob another component of system resources (file descriptors, memory, mutexes, etc.)
Rob another component of processing time, preventing it from completing its tasks on time
Access the private memory of another component. In the case of read access, this may be a security breach that could lead to a safety problem later; in the case of write access, this could immediately create a dangerous situation
Corrupt shared data, causing another component to behave in an unsafe manner
Create a deadlock or livelock with another component with which it is cooperating
Break its contract with a cooperating component by “babbling” (sending messages at a high rate or repeating messages) or sending messages with incorrect data

Isolation Axes And Techniques

Isolation along various axes is a fundamental strategy for protecting against interference (Fig. 2).² No single form of isolation is sufficient (see the table).

2. Partitioning functions along independent isolation axes greatly reduces the likelihood of interference among constituent hardware and software components.

In general, it’s best to isolate as many components as possible, using a variety of design and implementation techniques. These include use of an appropriate operating system (OS) architecture, rate- and deadline-monotonic scheduling, adaptive time partitioning, data diversification, anomaly detection, unsupervised learning, and progress monitors.

In a microkernel OS, components such as the file-system, device drivers, and network stack run in their own address spaces, isolated from each other (Fig. 3).³

3. A typical microkernel OS architecture gives each functions an isolated memory space to minimize interaction and the impact of individual problems.

Rate- and deadline-monotonic scheduling both refer to scheduling policies and tools that can be used in simple systems to ensure that processes will meet their real-time deadlines. In a real-time system it is incorrect to assign priorities according to a process’s “importance,” so other methods must be used. With rate-monotonic scheduling, for example, the processes with greater execution rates receive the higher priorities.³ It can then be mathematically proven for some systems that real-time deadlines will be met.

Adaptive time partitioning can ensure that processes are not starved of CPU cycles, while also making sure that system resources aren’t wasted. It assigns minimum levels of processor time to groups of threads to use if these threads need it (Fig. 4). This technique reduces the work required to ensure that processes aren’t starved of cycles, and it can be applied in complex systems where rate- or deadline-monotonic scheduling cannot be used.

4. Adaptive time partitioning dynamically adjusts the CPU resource minimum and maximum allocation to match individual process requirements so they are not starved of cycles or reserving an excessive amount.

Data diversification is the storing of data in different semantic ways.⁴ It provides an additional level of confidence in the integrity of the data, based on the premise that, while one algorithm may give a wrong result, it is unlikely that two different algorithms will give the same wrong result.

Predictor-corrector methods such as Kalman filters have traditionally been used for anomaly detection in external sensor inputs. These methods are less well suited for detecting anomalies in system variables, where there is rarely random noise on a value. However, they are useful for monitoring volatile resource usage.

As better unsupervised-learning techniques become available, more effective methods for detecting these sorts of anomalies are emerging. In supervised learning, a set of training vectors is made available to the program being taught. With unsupervised learning models, no training set is provided. In reinforcement learning, for instance, the program receives no help finding patterns beyond punishment for mistakes and rewards for correct answers.

Progress-monitor programs watch indicators in the system and take appropriate action should progress stop. They often include monitoring for process failures—the most extreme form of non-progress.

Validation Techniques

Auditors and regulatory bodies require assurances that isolation boundaries are not breached. Reference 5 describes the characteristics of tools that, with OS support, can trace interactions to demonstrate that inappropriate interactions are not occurring in the states invoked by the testing.

Validation techniques include discrete event simulation, formal design proving, static code analysis, deadlock and livelock avoidance, and other tools.

Discrete event simulation can provide a statistical level of confidence of correctness in cases where it is impossible to provide a proof of correctness of an algorithm, protocol, or isolation. An example would be for cases when event distribution is only available empirically.⁶

If a design can be formally proven correct and code generated automatically from the proven design, then verification and testing times can be reduced substantially. The widespread adoption of formal design methods has been hampered by a lack of automatic theorem provers, as well as a lack of designers able to handle the mathematical formalism of these methods. Recently, however, tools that hide much of the mathematics have become available, resulting in formal design being more accessible and practical.⁷

A weakness of static code analysis is its tendency to generate false positives. Further, the languages we tend to use (for instance, C with its use of pointers) can make static analysis less effective. However, more sophisticated static analysis tools are becoming available, with particular advances in techniques for exploiting contracts in the code and for symbolic execution.⁸

Deadlock And Livelock Avoidance

Conditions on the behavior of a system are often divided into safety conditions (“the system never does anything wrong”) and liveness conditions (“the system always eventually does something right”).⁹ Proving the correctness of liveness assertions is intrinsically difficult because the search space is unbounded, in principle: what might the system do in the future? Nonetheless, various tools can help detect deadlocks and livelocks at different stages of system development.

Tools are available to help detect deadlocks and livelocks, typically by analyzing an abstract model of the system, including both hardware and software. Some tools can generate code from the design, once it has been proven correct. Unfortunately, if the design is implemented manually, developers may introduce faults into the code and compromise the implementation.

Static analysis tools examine the actual written code. They can sometimes detect potential deadlocks, though this is more difficult for languages such as C than for fully defined and strongly and statically typed languages. These tools perform tasks ranging from extracting semantic information from contracts embedded in the code (as comments in languages that do not directly support programming by contract) to symbolic execution―a cross between dynamic testing and static analysis.¹⁰

When they are running, many systems provide a hardware watchdog timer that must be “kicked” periodically. However, it is difficult to determine which process(es) should kick the watchdog, because the watchdog remains oblivious to deadlocks from which it isn’t expecting a kick. Therefore, it is often useful to define conditions which guarantee progress is being made (for instance, a message from process A is received and handled by process B, or the value of an integer is changing monotonically), then put in place a software process to monitor the condition and take action should progress stop.

Conclusion

No single approach can provide isolation between software components in accordance with the requirements of ISO 26262, just as no single approach can provide adequate proof of this isolation. Used together, however, complementary design and validation techniques can provide an adequate level of confidence that isolation has been achieved. Some of these approaches need help from the underlying operating system, but others can be implemented at the application level.

References And Notes

First edition, 2011.
See many texts, including Jack P.C. Kleijnen, Design and Analysis of Simulation Experiments, 2007.
Briand, Loïc and Daniel Roy Meeting Deadlines in Hard Real-Time Systems: The Rate Monotonic Approach, IEEE Computer Society, 3rd ed., 1999.
Data diversification has a very long history; Charles Babbage described it and the associated pattern of code diversity in 1837: “When the formula is very complicated, it may be algebraically arranged for computation in two or more distinct ways, and two or more sets of cards may be made. If the same constants are now employed with each set, and if under these circumstances the results agree, we may be quite secure in the accuracy of them.” Ammann and Knight offer a more contemporary description in Ammann, Paul E. and John C. Knight, “Data diversity: An approach to software fault tolerance,” IEEE Transactions on Computers, 37(4):418–425, 1988.
Bonakdarpour, Borzoo and Sebastian Fischmeister, “Runtime Monitoring of Time-sensitive Systems–Tutorial Supplement”, Proceedings of the 2nd International Conference on Runtime Verification (RV), San Francisco, 2011.
See Kleinen, op cit.
Abrial, Jean-Raymond et al.“Rodin: an open toolset for modeling and reasoning in Event-B”, International Journal on Software Tools for Technology Transfer, 12(6):447-466, 2010.
Cadar, Cristian et al., “KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs,” Proceedings of the 8th USENIX conference on Operating systems design and implementation, 2008.
Cook, Byron et al., “Proving that programs eventually do something good,” SIGPLAN Not., 42(1):265–276, Jan. 2007. See also Edward G. Coffman Jr., et al., System Deadlocks, ACM Computing Surveys 3 (2): 67–78, 1971
See Cadar, op cit.