Hot-Swap Hardware And Software Hurdles Continue To Fall

After years of design-engineer frustration, hot-swap and live-insertion technologies are gradually evolving from expensive, special-purpose solutions to mainstream design alternatives. Thanks to standardization efforts from board vendors on both the hardware and software sides, hot-swapping designs, to at least some extent, are becoming easier and more reliable.

But obstacles still exist for full and transparent card swapping. Vendors have to surrender proven proprietary solutions for new and sometimes less-functional standardized ones. Adding features to standards becomes more difficult. It's not enough to have an operable hardware interface. System software and device drivers have to be written to dynamically reconfigure for the removal and insertion of cards.

How important is hot swapping in embedded design today? It depends on the industry. In telecommunications, the expectation of 100% uptime is the rule. So almost half of the system designs require hot swapping, according to a survey conducted by Venture Development Corp. (VDC), Natick, Mass., and presented at the 2000 Bus and Board Conference held in San Jose, Calif., this past February. In other fields, it's less critical but still quite important.

More importantly, this study expects hot-swap needs to be significantly higher in the future. Almost all industries predict that at least 25% of their applications will employ hot-swap technologies (Fig. 1).

High availability and fault tolerance are the primary reasons behind system requirements for hot swapping. As society becomes more dependent upon computers and automation, systems need as close to 100% uptime as possible. This is particularly critical in telecommunications and factory-floor operations. It's also a growing demand for systems handling data, transportation, and medical networks.

Plus, more complex systems tend to have lower reliability. Individual components are actually growing more dependable. But having more components in systems tends to reduce overall reliability. Hot swapping compensates for what can be a lower mean-time-between-failure rate for the whole system.

There are two fundamental purposes for providing hot-swap features on a bus interface. The obvious reason is to repair a faulty or damaged card without having to shut down the system. Supporting hot swapping and live insertion for repair assumes that when a card or module goes bad, the system can detect the failure and take the affected module offline. It can then notify the system administrator or repair facility that the failure occurred.

Thus, it's assumed that either the card wasn't a critical component or that some redundancy enables the system to perform at least some of its functions without it. If the component is so critical that its failure is automatically going to crash the system or render it unable to perform its primary tasks, then hot swapping may not make sense.

When the card or module is replaced, the system must be able to detect the replacement, prepare the new module for system operation, and then bring it smoothly back online. This process can be either driven by the system operator or done automatically.

Hot Swap For Reconfiguration Using hot swapping for system reconfiguration, on the other hand, can be done as a replacement or an enhancement. It also can be some combination of the two. The system software must be able to recognize new features that have been added by the card and take advantage of them. At the very least, this process demands updated device drivers in order to access those features.

Hot-swap capability has been a cornerstone of CompactPCI thinking since the beginning of its standards effort. That goes back to late 1996, when a formal hot-swap subcommittee of the CompactPCI group was formed. To come up with a successful standard, the group had to factor in interoperability between platform vendors and suppliers of operating systems (OSes)and other system software, as well as adapter-board manufacturers.

The connector interface would stay the same. But the subcommittee had to decide whether it should use a passive or active CompactPCI backplane for the cards (Fig. 2). This decision would determine all other design aspects of the hot-swap specification.

The passive-backplane approach could have fragmented the technology. There would be hot-swap cards and non-hot-swap cards of the same design, because bus-isolation and power-management circuitry would have to reside on the adapter cards themselves. Using an active-backplane approach would move those functions from the adapter card to the backplane. Adapter cards would then be universal and plug into either a standard or a different hot-swap-enabled backplane. The outcome was a nod in favor of an active-backplane design.

The resulting specification defines six classes of hot-swap-compliant software. It allows initial deployments with minimal requirements for live insertion and extraction. But these will evolve over time to reach increased complexity and sophistication. Hot swapping accomplishes this goal with a two-level software hierarchy. At the top level, software is classified as being either for general or specific use. The bottom layer has three hot-swap performance grades. They're defined when applied in the specific- or general-use categories.

Specific-use software lets applications be hot-swappable even if they use operating systems without explicit hot-swap support. As the name implies, though, application software developed under the specific-use category is not necessarily portable to other platforms. General-use hot-swap features are fully integrated with the OS. They produce the applications that most widely fit a general population of platforms and board drivers. These two categories define three levels of live-insertion capability: high availability, full hot swapping, and basic hot swapping.

Both sit on a common foundation. Full hot swapping builds upon the basic hot-swap design, while the high-availability design adds more layers and features on top of it. When at a hardware level, hot swapping requires a reliable bus-isolation method and power management. These two functions are performed on the plug-in board in the basic architecture, which describes the necessary attributes to unplug and plug in a board without disturbing bus activity.

This hot-swap model is the simplest and the least automatic. Console intervention is normally required to signal the system that a card is about to be removed or inserted. If it's being taken out, the OS must gracefully terminate any running software. It then signals the card to disconnect itself and power down. The reverse happens when a card is inserted. For that function, the card also needs to be enumerated and mapped. A CompactPCI signal informs the system that a card is requesting enumeration.

The full model further defines the method by which the operating system is told of the impending insertion or extraction of a board. A microswitch attached to the card injector/ejector signals the system that an operator is about to remove a card. It essentially functions as an early-warning signal. The whole software and hardware disconnect process follows.

The enumeration interrupt also informs the operating system of the impending event. After the OS has closed the board's functions, this interrupt signals the system operator that the board can be removed. When a new board is installed, the OS automatically configures the system software. This signaling method allows the operator to install or remove boards without reconfiguring the system at the console.

The high-availability model has a hot-swap controller that gives it the greatest capacity for reconfiguring software in a running system. Software and hardware components can be reconfigured automatically under application control. In contrast, both full and basic hot-swap models require operator intervention at some point. Console commands or ejector-switch activation and board removal usually unload the driver or install a new one.

By allowing software to control the board's state, this high-availability model increases both performance and system complexity. Control lines to the CPU inform the operating system that a board is present. The OS can then apply power to the board. Next, the hardware connection layer indicates that the board is powered up. The system signals from the CPU to release the board from reset and connects it to the bus. Individual boards can be identified and shut down, and others can be brought up in their place.

The ever-adaptable VME bus interface also is moving incrementally toward a hot-swap standard. Though it wasn't originally designed for hot swapping, VME has undergone a number of enhancements during its 20-year-plus life cycle. The market for its products remains strong. They're used in areas like military and aerospace applications, industrial controls, transportation, telecommunications, and medical systems.

The proposed VITA 1.4 American National Standard for VME64x live insertion provides a basic support framework. Boards don't yet exist to support it. However, customers seeking high availability and rapid reconfiguration are feeding its progress.

The key to hot-swap designs is the system software. Its ability to automatically detect the presence or absence of a board leads it to dynamically reconfigure the system to accommodate that change. Without this feature, there would be little hope for standardization in hot-swap hardware.

Operating-system software provides the intelligence to make hot swapping possible on running systems. To take appropriate action, the system software has to be aware of what should happen. This awareness must be built-in either at the application or operating-system levels. The software will then monitor and anticipate the removal or failure of dependent software components.

From this perspective, there is no fundamental difference in the way software or hardware components are used. Card removal actually means taking out the software components running on a card or accessing the card from the OS. This extends easily to include software upgrades within a running system or unanticipated failures of system software components.

This approach is possible on mass-market desktop operating systems like Windows 2000 and Windows NT. Due to their desktop origins and limited memory footprints, however, they're only sometimes used for more complex embedded systems. But PCMCIA cards and USB devices can be attached and removed during operation. In many cases, the OS can recognize the device and prepare it for use. It will still be less useful than it might be otherwise, though, due to the limitations and caveats of desktop operating systems.

To recognize a card removal and insertion, an operating system must be able to detect a hardware change and quickly adapt to the new configuration. The system administration software first disables new connections to the board's device driver. It then unloads that driver. The OS waits for all connections to terminate or forces the existing ones to do so. The physical disconnect that occurs is registered with the system.

In the CompactPCI approach, the hot-swap software model is the operating system between an application-programming-interface (API) layer and the hot-swap drivers. The signals communicating a hot swap occur primarily between those two layers.

This puts pressure on any operating system attempting to implement hot-swap features for CompactPCI or another hot-swap standard. The OS must include extensive and reliable memory protection so that the loss of one process or driver doesn't bring down the whole system.

One modern RTOS that implements memory protection is Enea OSE. It incorporates a memory-management system (MMS) that works with a PowerPC's memory-management unit (MMU) to provide separate memory spaces for running processes. It also gives them a method of interprocess communication. OSE processes can be grouped together into blocks that provide a finer degree of control over memory use. A block can have its own local memory pool. That way, if a block pool is corrupted, only the processes located within it are affected.

That interprocess-communication mechanism is consistent with OSE's philosophy of protecting executing processes. Rather than using shared memory spaces, it implements a message-passing mechanism that involves kernel calls. Memory or message ownership is never shared.

A different approach to dynamic reconfiguration is taken by the QNX RTOS. Device drivers aren't kernel processes. Instead, they run in user space. This makes it easy to start up and kill device drivers. A watchdog program can detect the removal or insertion of a card and automatically run or kill the driver process. Unlike other operating systems, it doesn't make the user rebuild the kernel and reboot the system when adding a driver.

Processor cards can be changed during operation with the QNX Neutrino kernel. Just place the processors on separate hot-swap interfaces, such as a CompactPCI card, and make sure there's a way to bootstrap the new card's kernel. It also supports SMP using Intel processors.

Hot-swap board designs are changing and developing every day. Growth in data access and exchange across computer networks and the Internet has driven up the need for high-availability and high-performance servers. Those servers are leading a development of new interfaces that will support hot swapping. Computer I/O, particularly data storage, is one area pushing hot-swap board designs ahead.

The Intelligent I/O initiative, or I²O, was among the first attempts to simplify I/O device connectivity by orders of magnitude. The actions of the operating system would be decoupled from those of the device providing the data. By separating the operating system's abstract I/O requests from the physical execution of the request, I²O made it easier to design a hot-swap interface that didn't impact the OS directly. The specification defined an I²O embedded processor, in this case the Intel i960, along with an RTOS that handled the details for card insertion and removal.

Any processor and bus interface might be applied, however, because I²O doesn't define the I/O hardware architecture. Its initial design assumed that the PCI bus was the data-transport mechanism. The bus then became the I/O bottleneck.

Other I/O architectures were then invited into the scene, including next-generation I/O (NGIO) and future I/O. Both hope to define a low-latency serial architecture for PC server I/Os at data rates of 1 Gbit/s. They envision a network of point-to-point serial connections between the various devices and the operating system, in which each device gets 100% of the bandwidth. These connections will be controlled by data switches.

Next-generation PCI designs, such as PCI-X, also are vying for a role in fast I/O. They have the advantage of existing hot-swap designs. The local bus has basically become the norm in all PC-based platforms. It has a strong following in applied computing as well.

This popularity rests in PCI's processor-independence, low-pin-count interface, and scalability up to 64-bit I/O performance. It provides today's most popular connectivity standard for a variety of peripheral devices. Motorola, for example, supports the PCI hot-swap version 2.1 specification on cards like its CPV5350 (Fig. 3).

Even while preserving these features and its backward compatibility, PCI keeps evolving. Version 2.2 includes hot-plug capability that's implemented on the host, making most existing PCI cards capable of insertion and removal without shutting down the host platform.

On paper lately, it's looked like InfiniBand is the hot-swap technology of the future. InfiniBand is a channel-oriented, switched-fabric, serial-point link I/O architecture for high-performance and high-availability data access. It derives high performance from an I/O engine that's coupled directly to host memory. Shared-bus architectures are replaced by a fabric of switchable point-to-point links.

This approach removes the CPU from the I/O subsystem. The CPU can then communicate with peripherals asynchronously. The I/O channel engine is responsible for moving data to and from main memory. By functioning as a switch, the bus enables point-to-point links to scale with improvements in the performance of CPUs, memory, and peripheral devices.

For increased reliability and a better basis for hot-swap approaches, InfiniBand supports separate fault domains for the CPU complexes and I/O units. At the same time, it handles reliable connection mechanisms, data integrity, and fault tolerance. The failure of any unit in the fabric doesn't impact the remaining nodes. The first InfiniBand-compliant products should begin to appear during 2001.

Hot swapping will never be a requirement for all embedded systems. Even with standardization efforts, a greater array of solutions, and the lower costs that come with more design alternatives, it won't become a staple. There will always be those systems that don't have the availability or live upgradability requirements to justify it.

But as costs come down and technologies become more reliable, more and more systems will be able to use hot-swap standards. Hardware interfaces are gradually being defined for recognizing the insertion and removal of cards. To start meeting these specifications, vendors are building products (see "The Route To High-Availability Networking," p. 100).

CompactPCI will clearly take the forefront in hot-swap designs, with an established and popular interface, along with working products, supporting the specification. Nonetheless, other standards are emerging, especially for mission-critical I/O applications. The result will be a solidification of hot-swap standards, better software support, and a greater variety of implementations for building high-availability and fault-tolerant systems.

Companies Mentioned In This Report
Enea OSE Systems AB +46 (0) 8 507 140 www.ose.com InfiniBand Trade Association Administration (503) 291-2565 www.infinibandta.org Intel Corp. (508) 756-8080 www.intel.com Motorola Computer Group (800) 759-1107 http://services.mcg.mot.com	Performance Technologies Inc. (716) 256-0200 www.pt.com PCI Interface Card Manufacturers Group c/o Virtual Inc. (781) 246-0500 www.virtualmgmt.org QNX Software Systems Ltd. (613) 591-0931 www.qnx.com