Trains are no longer the steaming leviathans of yesteryear run by men in overalls with lantern signals. Today, they are gleaming, articulated subways and light-rail transit (LRT) systems moving tens of millions of commuters daily. Or, they are behemoth 6000-hp (4.5 MW) locomotives hauling 100-car freight trains, totaling more tonnage than a navy destroyer. High-speed freights such as the “TGV postal”in France zoom along at more than 250 km/hr, while passenger trains such the “China Railway High-Speed” carry passengers at speeds reaching 350 km/h.
Controlling these trains requires increasingly sophisticated and complex embedded-system software. For example, a GE Evolution locomotive employs 20 Pentium-class systems to monitor and control the diesel engine, traction motors, compressors, battery chargers, radiator fans, and numerous other systems. These systems “measure and check 2500 to 5000 parameters with data latency varying from tens of microseconds to tens of seconds, depending on the system.”1
To ensure safety and efficiency, railways around the world now implement systems such as Automatic Train Protection (ATP), Positive Train Control (PTC), and Communication-Based Train Control (CBTC). Subways and other rail transit systems are adopting Automated Train Operations (ATO) systems and running “driverless” trains. These trains literally don’t have drivers. The train operator’s chief role is to assist in the case of failures and emergencies (Fig.1).
Standards, Dependability, And Isolation
The railway industry has a long history of developing safe systems. So, it isn’t surprising that two decades ago the CENELEC (Comité Européen de Normalisation Électrotechnique) developed railway applications standards EN 50126, The specification and demonstration of Reliability, Availability, Maintainability and Safety (RAMS); EN 50128, Software for railway control and protection systems; and EN 50129, Safety-related electronic systems for signalling.
Interestingly, these standards recognize a safety integrity level (SIL) not set by the broad IEC 61508 Functional safety of electrical/ electronic/ programmable electronic safety-related systems standard. In addition to SILs 1 to 4 defined in IEC 61508, these railway standards define a SIL 0 for “non-safety related” software. Also interesting to note is that EN 50128:
- Specifically points out the importance of software architecture: “The software architecture is where the basic safety strategy is developed for the software and the software safety integrity level.”3
- Stipulates that if commercial off-the-shelf (COTS) software is used in systems requiring SIL 3 or SIL 4, “a strategy shall be defined to detect failures of the COTS software and to protect the system from these failures.”4
EN 50128 also explicitly states what is known to anyone who has had to design or validate a safety-related software system: “There is no known way to prove the absence of faults in reasonably complex safety-related software.”5 In other words, “When we build a safe system, we cannot prove that the system contains no faults”; we can only “provide evidence to support our claims that our system will be as dependable as we say it is.”6
In a software system, dependability is a combination of availability (how often the system responds to requests in a timely manner) and reliability (how often these responses are correct). Both of these qualities depend heavily on the OS and, specifically (as noted in EN 50128), the OS architecture and its ability to isolate component failures to protect the system.
The OS architecture is vital for a couple of reasons. First, it’s fundamental to overall system dependability. Second, it determines how easy (or difficult) and costly it is to isolate and protect components with different SIL requirements.
For example, an ATO system may incorporate a multimedia component that displays non-critical information on an in-cab screen. A SIL of 1 or even 0 may be all that’s required by this component, while the critical components (handling communications with the wayside infrastructure, managing deceleration and braking, balance, alarms, etc.) demand SIL 3 certification or better.
An architecture that facilitates isolation of the SIL 0 component and demonstrates that it cannot compromise the safety-critical part of the system:
- simplifies the design, allowing use of, say, COTS software for the SIL 0 component with minimal integration work
- eliminates the cost of designing, building, and validating the non-critical component to SIL 3 (not required)
- makes for a safer system overall, because it reduces the scope of the safety-critical system and enables development and validation efforts to focus on the critical part of the system
Where dependability is an essential factor, as in any safety-related system, the OS should be designed to support guarantees of availability and reliability. These OSs are usually called real-time OSs (RTOSs). RTOSs differ primarily in their architectures, precisely the design characteristic that EN 50128 indicates is so important to a safety-critical design. The most common RTOS architectures are real-time executive, monolithic, and microkernel.
Though 50 years old, the real-time executive model still forms the basis of many RTOSs. With this model, all software components—kernel, networking stacks, file systems, drivers, and applications—run together in one memory address space.
This architecture, while efficient, carries two significant weaknesses. First, a single pointer error in any module can corrupt memory used by the kernel or any other module, leading to unpredictable behavior or system-wide failure. Second, the system can crash without leaving diagnostic information or traces that might help pinpoint the location of the bug.
Some RTOSs attempt to address the problem of a memory error provoking system-wide corruption by using a monolithic architecture. In this case, user applications run as memory-protected processes.
The monolithic architecture protects the kernel from errant user code, but kernel components still share the same address space as file systems, protocol stacks, and drivers. Consequently, a fault in any of these services can bring down the entire system.
In a Linux OS, for example, drivers make up some 75% of the code. Each line of code presents a potential fault that could reach the kernel. As with systems using real-time executive OSs, systems with monolithic OS architectures may have difficulty meeting dependability requirements.
A microkernel RTOS’s applications, device drivers, file systems, and networking stacks all reside outside the kernel in separate address spaces. Thus, they’re isolated from both the kernel and each other, which means a fault in one component will not percolate across the system. Further, because it’s still running predictably, the system can restart the failed component.
On top of that, the separation of components from the kernel and from each other can be advantageous when designing a safety-related system. As described in the example above, not all components need to be designed to achieve the SIL required of the safety-critical part of the system. The only provision is that the components with lower SILs be isolated from the safety-critical parts of the system.
This isolation could also be achieved with a virtual machine (hypervisor), but generally this strategy would require a more powerful processor, limiting the choice of suitable processors and increasing the cost of the hardware. It also adds a level of complexity to the system and may impact real-time performance.
Key RTOS Characteristics
A microkernel architecture is only one of many design characteristics that contribute to RTOS dependability. Other important characteristics include the ability to:
- meet real-time commitments by preempting lower-priority kernel calls
- prevent unpredictable behavior and system failure due to priority inversions
- guarantee availability of CPU resource scheduling to prevent critical processes starvation
- monitor processes with a software watchdog and, in the event of a component failure, either restart the component or move the system to its design safe state
A preemptible kernel is essential to any system that demands tasks completing on time, and is, therefore, a critical feature of any RTOS. OSs that don’t support preemptible kernel calls can be prone to unpredictable delays, causing critical activities to miss their deadlines and ultimately compromising a safe system’s ability to meet its dependability requirements.
These delays are triggered by high-priority user threads being obliged to wait for an entire kernel call to complete, even if this call was made by the lowest-priority process in the system. Even worse, priority information is usually lost when a driver or other system service (usually performed in a kernel call) executes on behalf of a client thread.
In a well-designed RTOS, these time windows when preemption may not occur are extremely brief (often in the order of nanoseconds). The RTOS imposes an upper bound on how long interrupts are disabled and preemption is held off. This upper bound allows developers to ascertain and subsequently accommodate worst-case latencies in the system design.
To ensure predictability and timely completion of critical activities, the RTOS kernel must be as simple and elegant as possible, so that there’s a clear upper bound on the longest non-preemptible code path through the kernel. The best way to achieve such simplicity is to design a kernel that includes only services with a short execution path, assigning work-intensive operations (such as process loading) to external processes or threads.
One of the more notorious errors to compromise system dependability is “priority inversion.” This problem, which plagued the Mars Pathfinder project in 1997,7 is a condition where a lower-priority task prevents a higher-priority task from completing its work.
For example, a thread with lower priority may simply block a thread with a higher priority (Fig. 3). This blocking can be caused by synchronization (e.g., the alarm and the data logger share a resource controlled by a lock or semaphore, and the alarm is waiting for the data logger to unlock the resource), or by the alarm requesting a service currently used by the data logger.
In the example shown in Figure 2, a medium-priority thread (data aggregator) preempts the low-priority logger. However, it doesn’t require the resource used by the logger, which keeps control of this resource. When the alarm tries to run, it preempts the aggregator, but can’t access the resource still controlled by the logger and blocks. With the alarm blocked, the scheduler looks for the highest-priority task that can run and thus runs the aggregator, effectively inverting the thread priorities.
Priority inheritance is one mechanism that can prevent priority inversions. It assigns the priority of a blocked higher-priority task to the lower-priority thread until completion of the blocking task. In the Figure 3 example, the data logger would inherit the alarm’s priority, and hence could not be preempted by the data aggregator. It would complete and revert to its original priority, and the alarm would unblock and continue, unaffected by the data aggregator (Fig. 4).
If a subsystem is starved of CPU cycles in a safety-related system, the services it provides may become unavailable to other subsystems—with unwanted consequences. For example, in a subway system, if a process in an on-board ATP system’s communication stack fails to respond at the expected time, that ATP system may assume a loss of communications with the wayside ATP infrastructure. At that point, it begins invoking safety procedures, slowing or stopping trains and disrupting service up and down the line.
Time partitioning8addresses the problem of resource starvation by enforcing CPU budgets, either through hardware or software. It prevents processes or threads from monopolizing CPU cycles needed by other processes or threads. Two types of partitioning are possible: static and adaptive.
Static partitioning groups tasks into partitions, and each partition is allocated a percentage of CPU time. No task in any given partition can consume more than that partition's predetermined percentage of CPU time. By making sure that every partition has its guaranteed portion of CPU time, this limit ensures that all key processes are always available.
Unfortunately, no process can ever use more CPU cycles than the limit allocated to its partition, even if other partitions don’t use all of their allocated times. Static partitioning thus squanders CPU cycles and reduces a system’s ability to handle peak demands.
Like static partitioning, adaptive partitioning reserves CPU cycles for a process or group of processes to create a system whose parts are all protected against resource starvation. Unlike static partitioning, however, adaptive partitioning uses a dynamic scheduling algorithm. It reassigns partitions’ unused CPU cycles to other partitions that can benefit from extra processing time (Fig. 5).
Partition budgets are enforced only when the CPU is running to capacity. As a result, adaptive partitioning allows systems to run at 100% capacity, enforcing partitioning budgets only when processes in more than one partition compete for cycles.
Furthermore, adaptive partitioning can adjust budgets while the system is running, based on predetermined criteria. For instance, a partition looking after braking adjustments might be allocated 30% of CPU time at speeds below 20 km/hr and 45% at higher speeds.
Safeguards against failures cascading through the system, along with self-healing capabilities, are crucial to a highly dependable OS. Systems requiring availability guarantees may implement hardware-oriented, high-availability solutions, as well as a software “watchdog.”
A software watchdog is a user-space process that monitors the system and performs multi-stage recoveries or clean shutdowns. In the event of a failure, the watchdog (depending on the implementation) can perform various types of operations to ensure system safety and recovery.
For instance, it can abort and then restart the process that failed, avoiding a system reboot. Alternately, it can terminate the failed process and related processes, initialize the hardware to a safe state, and then restart the terminated processes in a coordinated manner. Or, finally, if the failure is critical (and especially if the failure might compromise safety), the watchdog can perform a controlled shutdown or reset of the entire system, and sound an alarm to system operators.
In all cases, the watchdog must be self-monitoring and resilient to internal failures. If, for whatever reason, it stops abnormally, it must immediately and completely reconstruct its own state by handing over to a mirror process.
Finally, a software watchdog is able to monitor for system events that are invisible to a conventional hardware watchdog. For example, a hardware watchdog can ensure that a driver is servicing the hardware, but may not detect whether other programs are correctly communicating with that driver. A software watchdog bridges this gap and takes action when it detects an internal anomaly.
Train control systems are safety-related systems. They must meet the strict dependability requirements set out in IEC 61508 and the EN 5012x group of standards. Certain OS characteristics most directly affect system dependability: architecture, features that support real-time guarantees, fault isolation, and recovery from component failures.
The discussion can be extended to include topics such as the communications stack, human-machine interface (HMI) technologies, support for multicore processing (including processor or core affinity), and the use of certified COTS components (e.g., an IEC 61508 SIL 3-certified OS kernel).
- John Dodge, “Locomotives Pull Into The Digital Age,” Design News, May 16, 2005.
- Paul Leroux, “Railway Communications Stay On Track With QNX,” On Q, 10 Jan. 2010.
- BS EN 50128:2001 Incorporating corrigendum May 2010, Introduction, p. 6.
- Ibid., Clause 9.4.5.
- Ibid., Introduction, p. 5.
- Chris Hobbs, “The Limits of Testing in Safe Systems,” Electronic Design, Nov. 11, 2011.
- Michael Barr, “Introduction to Priority Inversion,” Embedded Systems Programming, Volume 15: Number 4, April 2002.
- Some OSs also support “space partitioning,”which provides a guaranteed amount of memory for each partition.