Electronic Design

High-Availability RTOSs Deliver Five-Nines Reliability

To work, multiprocessor systems and hot-swap hardware require high-availability RTOSs.

New real-time operating-system (RTOS) enhancements make 99.999% availability and real-time application requirements achievable. Applications like transaction processing, process control, communi- cations switching, and air-traffic control are just a few examples where any downtime cannot be tolerated. Such companies as Monta Vista, OSE Systems, QNX Systems, Red Hat, Lynuxworks, and Wind River Systems have added high-availability services to the list of modules that can be incorporated into an RTOS.

The technology of high-availability systems isn't new. IBM, Sun, Microsoft, and others have done it for years. Custom embedded systems have often utilized high-availability techniques through customized software instead of standardized OS support.

High-availability hardware isn't new either, but this type of hardware such as RAID disk and tape support is showing up in more embedded and real-time systems. Standard CompactPCI systems, like those from Force Computers, provide hot-swap board support. Likewise, network interconnects, including Ethernet and InfiniBand, give developers a choice of implementation methods. Today, off-the-shelf hardware can provide high-availability support with an off-the-shelf RTOS.

High-availability hardware systems available generally feature:

  • Hot-swapping capability. This is available in computer boards like CompactPCI boards and disk and tape drives.
  • Multiprocessor links. Popular buses like InfiniBand and CompactPCI as well as networks like Ethernet include this feature.
  • A RAID (redundant arrays of hard disks) architecture as found in disk and tape drives.

It's important to recognize the roles redundant hardware and hot-swapping play in a high-availability system (see "Hot-Swapping Is Only Part Of The Hardware Story," p. 44). A number of hardware technologies are available to implement high-availability systems.

Software support for high-availability systems is cropping up in a number of places (Fig. 1). Now, even an application programming interface (API) exists for CompactPCI.

Checkpointing, transaction support, and application heartbeat support are just some of the features be-ing used with real-time systems. But the APIs aren't always standardized across vendors because each OS implements a heartbeat support in a different fashion.

Checkpointing is the ability to save enough information from a process to restart it if it fails. Heartbeat support is the act of finding when a process fails.

Modularity is still the key aspect of high availability in an RTOS. One example can be seen in a partitioning of high-availability services that closely match an OS, in this case, Wind River's new VxWorks Foundation HA, which builds on the company's VxWorks AE RTOS (Fig. 2).

Other examples include Lynuxworks Lynx/HA and Monta Vista's High Availability Framework, which add high-availability support to Linux-compatible and Linux operating systems respectively. These additions have a modular construction similar to VxWorks Foundation HA.

Hardware may steal the limelight in numerous circuit designs, but high-availability hardware won't work without the correct software. More importantly, high-availability applications need to operate regardless of the kind of hardware available in the system. In particular, applications must continue working with other applications in the system, even if one application fails due to errant coding, a lack of resources, or other software-related problems.

In some cases, software failover support can be provided transparently. That's how many message-based systems operate.

In general, a high-availability system should have the following software services:

  • Heartbeat support for each server and each application.
  • Event management capability for change notification.
  • Alarm management for error handling.
  • Transactions capability for check-pointing and rollback/restart.
  • Clustering for server management and applications links.
  • Reliable storage support for RAIDs and for journaling file systems.

With QNX, applications communicate with each other using a messaging system that is part of the RTOS' core services. The QNX message system supports transparent message-based services independent of its new high-availability support. The QNX link manager can detect a failed application and redirect messages to an alternate application (Fig. 3).

The link manager can utilize alternate paths between applications and start up a new application if necessary. Changes are handled based on an application's description of a link. QNX uses messaging for all major services, and messages move transparently across node boundaries (Fig. 3, again). Of course, this redirection works equally well between applications on the same node.

Some RTOSs add messaging capabilities as part of their high-availability services. For example, Lynuxworks Lynx/HA includes message-oriented middleware that uses unicast, broadcast, and multicast transmissions for notification of system events. Lynuxworks also includes CORBA-compatible quality-of-service options.

Web Clustering
High-availability Web-based services use redirection at a higher level. The QNX link manager is replaced by a dispatch server that forwards incoming requests to an array of servers. These servers respond directly to the source of a request.

Red Hat and TurboLinux are two companies that provide Linux-based cluster solutions. Red Hat also delivers embedded cluster solutions. Transparent redirection is only one aspect of clustering. Nontransparent support allows tighter integration between rep-licated services.

IBM, Microsoft, and Sun Microsystems have extensive clustering solutions. Although these tend to be used in high-end installations, the same techniques are applicable to em-bedded environments.

In fact, Microsoft's Windows NT Em-bedded supports the same services that are available with Windows NT. Micro-soft's replacement for Windows NT Embedded is even more robust when it comes to clustering technology.

APIs for this type of clustering support are OS-specific. Applications must take advantage of these APIs, and applications that work together are tightly integrated.

Exceptions, such as a failed service or application, must be handled explicitly. High-availability support typically provides services like checkpointing and transaction rollback.

RTOS high-availability modularity allows developers to choose the kinds of services needed to support their particular requirements. This may include hardware support such as hot-swap recognition, device failure, environment problems like overheating, or the use of reliable storage.

It might further be limited to event and alarm support. Even basic heartbeat monitoring can help bring a system into high-availability land if applications are written to handle faults.

Certainly, additional high-availability modules should make the programmer's job easier. For this reason, high-availability technologies from high-end systems, such as clustering, are finding their way into embedded systems.

Some high-availability technologies already exist in many RTOSs. Those from QNX are an example. This message-based RTOS provides transparent message redirection as part of the regular RTOS implementation. Additional support addresses features typically not found in a basic RTOS, such as transaction-oriented checkpoint support.

In this case, a checkpointed task provides data and restart information as part of a checkpoint that's managed by the QNX high-availability monitor. If the task terminates or fails to respond in a set time, the monitor will start a new task.

Using features like checkpointing becomes significantly easier with off-the-shelf components if the RTOS vendor provides support for the boards used in the system. The latest crop of high-availability add-ons, such as those available from Wind River and QNX, have the necessary support.

Meeting the five-nines requirement isn't the only reason to consider for high-availability support. Simply providing a more reliable product is justification enough to consider a high-availability-enabled RTOS—either that, or build it from scratch.

Yes, high-availability RTOS integration is just beginning.

Need More Information?
Force Computers
(408) 369-6000
www.forcecomputers.com

Green Hills Software Inc.
(805) 965-6044
www.ghs.com

IBM Corp.
(800) IBM-4YOU
www.ibm.com

Lane 15 Software Inc.
(512) 502-9898
www.lane15.com

Lynuxworks Inc.
(408) 979-3900
www.lynuxworks.com

Monta Vista
(408) 328-9200
www.mvista.com

Microsoft Corp.
(425) 882-8080
www.microsoft.com

OSE Systems
(408) 392-9300
www.enea.com

PCI Industrial Computer
Manufacturers Group

(781) 246-9318
www.pcimg.org

Pigeon Point Systems
(831) 438-1565
www.pigeonpoint.com

QNX Systems
(800) 676-0566
www.qnx.com

Red Hat Inc.
(919) 547-0012
www.redhat.com

Sun Microsystems Inc.
(800) 786-7638
www.sun.com

TurboLinux Inc.
(650) 244-7777
www.turbolinux.com

Vieo Inc.
(512) 257-3031
www.vieo.com

Wind River Systems
(800) 545-WIND
www.windriver.com


Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish