PCI Scales New Heights With Switch-Fabric Interconnects

Choose a transparent-bridge mode for legacy applications, or a path-routing gateway function for advanced systems.

April 29, 2002

10 min read

PCI and CompactPCI (cPCI) technology and systems are pervasive across a wide range of applications. But as these applications evolve, the limitations of both technologies are surfacing. At the same time, the investment in them can't be written off. Consequently, designers who push PCI and cPCI have hit limitations like the lack of scalability, Quality of Service (QoS), and high-availability issues.

Luckily, new switched-interconnect technologies take PCI to another level. One example is StarFabric, a scalable, universal switch fabric that addresses the unique requirements of communication engineers designing for next-generation data, voice, and video networks.

The need to scale systems is driven by telecommunication carriers' requirement to control expenses. Besides system-acquisition prices, rack-space cost and the cost to manage systems are gaining importance to carriers. All types of carriers are placing a premium on rack space, forcing system designers to increase port densities in their systems.

Similarly, as systems grow from one chassis to multichassis, the total package must look like a single unit to the management system. Current PCI-bridged solutions create both complex hierarchies and a significant increase in latency, which limits the ability to scale sufficiently. But a switched-interconnect solution eliminates this problem.

Evidently, the public switched-telephone network (PSTN), based on time-division multiplexing (TDM), won't go away anytime soon. Therefore, all kinds of systems must sustain both TDM and packet traffic. Today's solutions require separate TDM and packet buses. As noted, each bus needs to meet the necessary scaling.

But scaling is becoming a challenge. One approach is to combine the buses into a switched-interconnect technology, such as StarFabric, then send native TDM traffic and packet traffic over one switch fabric—with the TDM voice traffic having a higher priority. Plus, the 8-kHz timing reference can go through the fabric to retain synchronization.

In the telecom world, five-nines availability has always been the rule. But in the packet world, five-nines was solely a powerpoint concept. As the packet and TDM worlds "converge," though, the five-nines goal is becoming mandatory. Frequently, five-nines are table stakes with the real availability requirement of six-nines or more. If systems with separate TDM and packet buses ever meet the availability requirement, they will need redundant buses for each. This would take four buses within the system.

The cost and complexity involved is substantial. A more elegant solution is a single fully redundant switched-interconnect topology that provides the inherent QoS treatment of the TDM and packet traffic.

To meet these challenges, StarFabric technology is being designed into a plethora of communication and distributed-computing applications, ranging from voice-over-packet media gateways and optical edge routers to video servers and medical imaging platforms. All require the low cost and high scalability of a switched-interconnect technology. The built-in support for QoS—native TDM voice traffic, high availability, and loosely coupled distributed computing with efficient memory-to-memory communication features—makes it ideal for many compute and data-intensive applications.

There are two initial silicon components using StarFabric technology. One is a high-throughput switch (SG1010) that supplies 30-Gbit/s switching capacity with six ports. The other is a PCI-Fabric bridge (SG 2010) device that interfaces 64- or 32-bit PCI buses (operating at 66 or 33 MHz) to StarGen's switch fabric.

The fundamental physical-layer interconnect for the initial components is a 622-Mbit/s differential pair with a 400-mV swing. Each port consists of four of these pairs in each direction—yielding an aggregate bandwidth of 5 Gbits/s. This accommodates the bandwidth required by next-generation communications equipment. For the initial components, two of these ports can be bundled to provide a 10-Gbit/s "fat pipe" between endpoints.

Depending on the implementation, a switched-interconnect system can comprise any number of components. This gives designers the freedom to build a variety of systems, from very small-scale systems to very large-scale systems with hundreds of endpoints.

The PCI-to-StarFabric bridge is a multifunction device. It incorporates both a familiar transparent-bridge function and a fabric native-gateway function (Fig. 1). Several flexibility dimensions exist within StarFabric, and as engineers begin their next platform designs, they must consider some design tradeoffs. For example, they're focusing on using transparent and nontransparent bridging functionality, and pure address routing versus path routing for communicating across the fabric. Both areas are enabled by the dual-function nature of the initial PCI-to-StarFabric bridge.

Legacy-Mode Applications: When employing the transparent-bridge function, the PCI-to-StarFabric bridge looks to software, drivers, existing BIOS, and operating systems as a standard PCI-to-PCI bridge. By executing this function in all PCI-to-StarFabric bridges throughout the system, present legacy software code can run without changing one bit, including drivers and configuration code. In this legacy mode, called address routing, a system is situated with one flat global-address space. All devices in the system are visible and accessible through their unique address ranges.

The tradeoff here is that the advanced features of StarFabric, such as path redundancy and QoS, can't be realized. But this transparent-bridge mode gives the system designer a fast time-to-market solution.

For the initialization process, one processing node must be established as the root. This can be accomplished either through strapping by the system designer, or by an election process at the hardware level. When a system is powered on, the hardware automatically initiates numerous procedures.

First, all nodes (StarFabric devices) attempt to synchronize with their link partners by initiating traffic. Once synchronization has been established, an enumeration "storm" occurs. This happens automatically by the silicon. At the end of this "storm," each node in the fabric is established with a unique fabric identification number and stored path from itself back to the root node.

At this point, the BIOS and operating system begin device discovery. To the software, all PCI-to-StarFabric bridges and StarFabric switches look like PCI-to-PCI bridges. In the discovery process, a system map, or a hierarchy of PCI bus segments, is returned. Figure 2 shows an example of a simple fabric topology and the resulting PCI hierarchy established by the discovery process. At this point, the normal process of device resource allocation is performed. System communication is now possible via standard PCI address methods.

Fortunately, it takes absolutely no changes in software to establish and utilize this system. The existing BIOS firmware and operating system are used without modification. In this example, there are now five PCI bus segments, which could each have five to eight devices in one monolithic system—all communicating through one address space.

This mode enables lots of real-world applications. One is supporting a remote chassis of PCI expansion slots for an existing server or workstation that's currently limited by the number of slots in its chassis. Because StarFabric nodes can be connected with five meters or more of CAT 5 cable, a compute chassis can be in a separate enclosure, remote from its noisy disk drives—possibly in an adjacent room.

Advanced Applications: As designers look to build larger and more diverse systems, like communication platforms with StarFabric, the pure-legacy, address-routing mode becomes a limitation. The hierarchical system map requires all traffic to flow through the root of the tree (Fig. 2, again). Also, legacy PCI address-routing methods don't take advantage of all present interconnects between StarFabric nodes. For larger systems, traffic must first flow up to the root/host node, then to its destination.

Many developers will want to benefit from features that eradicate these limitations, and others that provide QoS, high-availability, and address/device isolation at each node for distributed-computing applications. For these reasons, it may be desirable to move to path routing and use the gateway function in the PCI-to-StarFabric bridge.

Conceptualize the gateway function within the bridge as an embedded PCI-to-PCI bridge. With it, devices, including processors, can be "hidden" behind the bridge. Each processing element can have its own address space and control its own devices. Communication across the fabric is done by connections set up at system initialization. A connection specifies the route through the fabric, from a single device to every other device with which it needs to communicate.

In fact, multiple paths from every source to each destination can be stored. A primary path is normally used, but in case of a problem somewhere on that path, a secondary path specification can be employed. This high-availability feature can be set to either fail-over from the primary to secondary route automatically in silicon, or to fail-over with software intervention. In either case, event notification is supplied to an event handler whenever a path status changes within the fabric.

During path routing, the system boot-up process picks up where address routing left off. Once the initial enumeration "storm" has established the unique fabric identification for each fabric node, the BIOS and operating system create the PCI hierarchy view of the system—and the boot process continues. At this point, a software object called the Fabric Programmers Library (FPL) is invoked, and the path-routed connections desired by the system designers are established. The FPL offers a wide range of capabilities.

Path routing lets designers build highly capable distributed-computing systems. It supports sophisticated schemes for handling system events and interrupts. The system can be devised to control devices and locate drivers anywhere in the system. Any processor can be set up to handle interrupts, with various processors working with interrupts from different devices.

Only a local memory read is necessary to identify an interrupt source. In this environment, the system knows exactly where these interrupts came from, and sometimes, how to deal with them. In a large-scale system, this interrupt-source identification alone provides two to three orders of magnitude lower latency for interrupt handling than a conventional system.

Moreover, developers can build priority queues for I/O interrupts—in contrast to the simple nonprioritized interrupt vectors of a conventional system. This ability to map and route interrupt priorities permits developers to implement interrupts in new ways.

Also, in-band interrupt schemes are far more efficient than PCI's out-of-band approach. For example, if an SCSI device sends an interrupt to a processor in a PCI system, the data might not have completely cleared the bus by the time the processor fields the interrupt. But interrupts can travel in-band via data in StarFabric, so data associated with the interrupt can be pushed ahead of it into memory, where the processor uses it immediately. Depending on the system map, the processor may not have to read devices at all, but just perform a write that clears the interrupt.

The path-routing operating mode permits the system to attain real QoS. Seven classes of service are supported by the StarFabric protocol: asynchronous, high-priority asynchronous, isochronous, high-priority isochronous, multicast, special, and provisioning.

A credit-based flow-control algorithm moves traffic through the system only when sufficient buffer space exists at the next node. Frames never need retransmission unless an error occurs. Congestion within one class, say asynchronous, won't block other more important classes, like isochronous. The programming interface grants system designers the flexibility to reallocate the credits of every class based on the specific application.

Aside from pure-legacy mode or path-routed systems, the technology handles hybrid systems of address and path routing. This way, designers can maintain the advantages of legacy-code support for many system functions and add the benefits of path routing where appropriate.

In summary, switch-fabric technology gives system designers a wide range of options and flexibility. They can trade the ease of design and reuse of existing software of address routing for the robust feature enhancements of path routing. Many will take an evolutionary approach where address routing and path routing coexist, with the percentage of each varying over time.