Next-Generation Interconnects Will Drive Multiprocessing

Since the days of the early serial mainframes, parallel buses have dominated multiprocessing (MP) connections. But now, parallel MP buses will be edged out by very high-speed, pseudoserial switch fabrics. These will provide a more flexible, scalable, and reliable interconnect than standard, multidrop parallel buses. Memory-bus-to-memory-bus transactions will become packetized and will be put into specialized protocols to compete for networked bandwidth.

Switch fabrics like InfiniBand aren't only for linking system and networked servers. Expect to see them also at the processor chip level and at the board level on up, because switch fabrics aren't just the domain of system vendors. After all, it was the microprocessor chip vendors—companies like Intel and Motorola—that pushed packetized, pseudoserial bus connections. Intel launched Next Generation I/O (NGIO), the channelized, pseudoserial switch fabric that eventually mutated into InfiniBand. Furthermore, it was Motorola, with Mercury Computer Systems and others, that created RapidIO, a chip- and board-level switch fabric.

This shift to serialized, switch-fabric connectivity involves more than simply changing bus technology. For one thing, MP hardware systems will take on the characteristics of networked systems implementing protocol stacks of processing.

For another thing, connections will become more dynamic and scalable, enabling designers to easily add another compute or storage resource to their system. Plus, switch fabrics will help integrate computer and datacom/telecom systems. For the first time, both will use the same kind of serialized interconnect, differing only in the protocol stack that each implements. These emerging high-speed pseudoserial interconnect systems include:

InfiniBand: a pseudoserial switch fabric for system-area networks (SANs). It supports both switch and router functions, and it can be de-ployed at the intrasystem, subsystem, and chassis levels. InfiniBand supports wide MP systems, but it may not be suitable for systems that require a real-time response. The specification is now complete, and silicon and software will be out in 2001.
RapidIO: a pseudoserial switch fabric supporting processor-, board-, and chassis-level interconnects. MP vendors, like Mercury and Sky Computer, back RapidIO as the next-generation MP switch fabric for processors like Motorola's G4. This specification is under development.
StarGen: a pseudoserial, high-speed switch fabric that supports board- and chassis-level interconnects. It provides a high-speed alternate bus that serves as a high-speed virtual PCI. StarGen enables MP systems with traditional PCI to move large amounts of data between nodes. Silicon and software will be available in the first quarter of next year.
GigaBridge: a pseudoserial, high-speed bus architecture that supports board- and chassis-level interconnection at the PCI level. It provides a virtual PCI interconnect that links nodes with a PCI bus connection. GigaBridge employs a dual, counter-rotating ring to connect nodes. It adds more data bandwidth for traditional PCI-based servers and edge computers. Silicon and software will also be out in the first quarter of next year.

The essential characteristics of these new switch-fabric interconnects are scalabililty and flexibility. These switch fabrics are free from the limits of multidrop parallel buses. Unlike a parallel bus, a switch fabric doesn't have a fixed bandwidth or connection limits. It's made up of endpoints that connect through a fabric of interconnected node switches. These node switches are intelligent crossbars and the link between port-in message traffic and port-out message traffic. Additionally, each node switch, with its ports, has a built-in bandwidth limit of n/2 ports multiplied by the amount of bandwidth per port. Switch fabrics aren't new to MP systems. Both RACEway and SKYchannel delivered parallel switch fabrics (see "Early MP Switch Fabrics," p. 84).

But unlike a parallel bus, the fabric is extensible. If you need more bandwidth, just add more paths and switches. The fundamental limit on a switch fabric, though, is the aggregate bandwidth of its endpoints surrounding the fabric and the accumulated latency generated when moving through each switch node to the ad-dressed endpoint.

Recognizing the move toward a networked world, developers of the InfiniBand switch fabric chose to create an interconnection technology for networked systems (Fig. 1). This switch fabric goes beyond only supporting point-to-point dynamic connections between endpoints, or host CPUs and peripherals. It also supports both switches (transfers in a local subnet) and routers (transfers between different subnets). Therefore, InfiniBand can handle local, campus, and even wide-area connections between its endpoints. It's capable of very wide MP operations. Furthermore, InfiniBand drives up to 100 meters of cable. For larger WAN-like installations, however, communications links are needed between the InfiniBand segments.

But InfiniBand does more than only provide a network switch fabric with WAN-class connectivity. It ups processor and server I/O efficiencies too. InfiniBand adds channel capability to standard server processors.

Processors Offloaded With InfiniBand, host processors can address their local and remote I/O via I/O channels. This regularizes the I/O interface and offloads the processors from as much as 90% of the I/O processing. In effect, it adds mainframe-like I/O control to offloaded I/O processing. These channels simplify I/O processing for MP applications, providing a regular mechanism for communicating with both peripherals and CPUs. They support both messaging and high-level channel "verbs" for accessing data between InfiniBand endpoints.

InfiniBand supports point-to-point communications between endpoints. These endpoints can be host processors, which link to the switch fabric via a host-channel adapter (HCA), or target hardware that links to the switch fabric via a target-channel adapter (TCA). Between the endpoints, the fabric is made up of switch chips. InfiniBand is a first-order fabric. It supports its host computers at the memory controller level, with the HCA right up against the controller, not removed as in a peripheral connection.

Additionally, InfiniBand delivers MP-class bandwidth. Built on bidirectional LVDS differential pairs, InfiniBand links support link widths of one, four, or 12 bidirectional links. Each of these links runs at a signaling rate of 2.5 Gbits/s, delivering 500 Mbytes/s of raw bidirectional data bandwidth per link, 2 Gbytes/s for a four-wide link, and 6 Gbytes/s for a 12-wide link. InfiniBand transactions are made up of messages, which in turn are comprised of packets. Each packet supports a maximum 4-kbyte packet data payload.

For MP applications, InfiniBand supports independent memory spaces for its nodes and host processors. Coordinating multiple processors running independently requires both memory protection and secure I/O at the processor level. Each processor needs to protect its memory from unwarranted intrusion, especially from external I/O. InfiniBand provides those protections. Accesses to host-computer memory is through "memory regions" registered by that memory's host CPU. This memory is addressed by other host computers or peripheral targets via an assigned protection key and a virtual address. InfiniBand also implements memory Windows, a more restricted RDMA access to a window in a memory region.

InfiniBand is designed for robust performance. It features virtual lanes, the multiplexing of a single link connection to multiple message streams for up flow. Also included are service levels for quality-of-service (QoS) implementations, plus a Queue Pair mechanism for endpoint-to-endpoint service, path mapping, and global Ipv6 addressing. InfiniBand provides five types of transport service, ranging from Reliable Datagrams and Reliable Connections to Multicast and Raw Datagrams (it can encapsulate raw packets from other protocols).

On the negative side, InfiniBand might not be suited for many real-time MP applications. Its sophisticated, packetized connections have a latency downside. As packets pass through the switch nodes, a lot of processing needs to be done at the switch nodes in order to route packets, maintain packet reliability, and support QoS. One reason why InfiniBand is so systems-oriented is that its core developers include Compaq, Hewlett-Packard, IBM, and Sun Microsystems Inc.

From Words To Packets Packetized, switched buses will replace parallel buses at the embedded-processor chip level for many applications. We're talking about replacing the processor's parallel system bus with a high-speed, pseudoserialized connection. For example, both Motorola and Intel are working on pseudoserial, packetized buses for their high-end processors. Motorola's version has mutated into RapidIO, an emerging switch fabric for embedded systems. But Intel has yet to release details on its internal work for the PC/server platform.

Motorola's engineers wanted a high-performance bus connection that took up less space, was easier to manage, and provided easy extensibility at the chip and board level. To do this, Motorola joined development forces with Mercury Computer Systems, the developer of the RACEway parallel switch fabric and a vendor of high-end MP systems. The result of that union is RapidIO (Fig. 2).

RapidIO is a low-latency, point-to-point connection, inside-the-box switch fabric for embedded systems. For board-level systems, it furnishes the glue to fit everything together with a high-bandwidth connection. RapidIO permits designers to easily interconnect processors, memory controllers, communications processors, bus bridges, and peripherals for building MP systems. Plus, designers can use RapidIO to implement transparent PCI-to-PCI bridging, as well as build a RapidIO backplane for RapidIO-based boards. Or, RapidIO can be deployed as an adjunct bus in a backplane system, using the spare backplane I/O pins.

RapidIO has the necessary bandwidth for next-generation processing. Using LVDS differential pairs, it samples data on both edges of the clock, with clock rates projected out to 1 GHz. Currently, RapidIO delivers 2-Gbyte/s bandwidth for its 8-bit version and 4 Gbytes/s for the 16-bit version. On a board, RapidIO can drive up to 30 in. of trace conductor, enough for board-level implementations and for a backplane.

This is a classic packet-based switch system, with multiple packets making up a RapidIO transaction message. A RapidIO message can have up to 16 packets, each carrying a 256-byte payload, or up to 4 kbytes of message data. RapidIO transactions include Reads, TLB synchronization, Atomic operations, Writes, Streaming Writes, Maintenance (configuration, control, status register read and write), Doorbell (in-band interrupt), and Response (Read or Write packet response). RapidIO implements control symbol packets for packet acknowledgements, and for passing flow control and maintenance information.

The switch fabric defines three transaction flows. These are Retry, issued by the receiver when a packet has errors or can't perform requests; Idle, which enables a receiver to insert "wait states" into a flow; and Credit, a credit mechanism by which receivers maintain a buffer pool for each transaction flow, where the sender sends packets only when the receiver has buffer storage available .

For board-level and some chassis-level interconnects, RapidIO supports a memory-mapped, distributed memory (i.e., a single, if distributed, memory model). So, all nodes can belong to the same memory space and use memory-mapped I/O. This is open to use by standard cache-snooping mechanisms for MP operations.

But because RapidIO is packet based, it supports a noncoherent, nonshared memory model as well. Therefore, it can be used for MP systems that have multiple subsystems, each with its own nonshared memory spaces (NUMAs or nonuniform memory architectures). In cases like this, a RapidIO node can access a local space only through a message-passing interface controlled by that local memory's own node or subunit.

For large, global memory systems, RapidIO implements a directory-based coherency scheme, which supports up to 16 Coherency Domains or clusters. Every cluster memory controller is responsible for tracking its memory elements with its own local directory. This references the most current copy of every shared element and where the copies are located.

Building A Virtual PCI InfiniBand and RapidIO define their own switch fabrics, each with its own endpoint interface. Another way to add switch-fabric connectivity and bandwidth for high-performance and MP systems is to piggyback onto an existing bus interface—that is, to build an interconnection subsystem that looks like a standard bus but adds new bandwidth capacity. The vendors of these new adjunct bus connections can build to an existing standard, PCI. The adjunct bus' underlying hardware can use any bus protocol or hardware implementation as long as it appears to be a PCI connection to the connecting endpoints.

But, MP systems utilizing these new adjunct bus connections will abide by PCI's processing and addressing rules. This means that a single host controls the address space and enumeration. Still, nontransparent bridging to PCI peripherals is an option, which supports complete subsystems with their own address space. These subsystems can operate independently, but they also share some limited PCI space to pass information back and forth.

Currently, two new adjunct bus systems are emerging that can add high-speed bandwidth to PCI-based systems: StarGen and GigaBridge. Both provide PCI-like interfaces to their client systems and replace PCI with their own underlying bus technologies.

StarGen is a classic pseudoserial switch fabric that has been molded into a Virtual PCI system (Fig. 3). It provides PCI connections, which are endpoints to its switch fabric. Made up of individual Star switch chips, the switch fabric provides an adjunct bus for moving high-speed data. This underlying bus is fully PCI-compatible, works PCI bus operations, and requires few software changes aside from extending the PCI address space to the other endpoints. StarGen interfaces to all PCI bus varieties—33 or 66 MHz, and 32 or 64 bits.

This Virtual PCI bus implementation targets telecom/datacom gateways, front-end media-access, and edge-router markets. This is an arena where systems are typically configured and then run for a long while before undergoing reconfiguration. Most changes for putting the system together or adding major new subsystems take place here. Thus, much of the system configuration and routing is known at least when the Virtual PCI adjunct bus is added. Therefore, the StarGen design team took a very different tactic for its switch fabric.

Instead of opting for a dynamically configurable system—that is, continually making dynamic point-to-point connections and routing them—they did the opposite. Their system accomplishes much of its routing on configuration, storing its routing information in its nodes (endpoint A's routing path to endpoint B). So, the StarGen system has its routing paths figured "down cold" after startup. This enables very efficient packet transaction routing and delivery because it's going over predefined paths, as well as simplified error recovery and maintenance.

StarGen's switch fabric de-ploys three routing mechanisms. These include Address, full PCI compatibility; Path, prerouting for efficient operation; and Multicast, which sets up grouping for communications. The fabric supports asynchronous, isochronous, multicast, and high-priority traffic. It employs a credit-based flow control, with messages forwarded only if enough buffer space (credit) is available at the next switch. QoS is supported by a number of switch features, including buffering by traffic class, a bandwidth reservation system for real-time traffic, and best-effort delivery.

A PCI-To-PCI Bridge Many members of the StarGen management and design team became experienced with PCI buses by working for the PCI bridging group of Digital Equipment Corp., and they have built that PCI bridging experience into StarGen. In effect, the whole StarGen endpoint-to-endpoint connection (endpoint to switch node....to switch node....to endpoint) is a PCI-to-PCI bridge, although this is typically a bridge from a PCI system to a nontransparent PCI device, as the far PCI connection is usually a different PCI system.

The basic StarGen switch node supports six ports. Each port drives up to 5 Gbits/s of bandwidth, or 625 Mbytes/s, with half in each direction. To be fair, however, in a typical six-port switch situation, three ports each connect to the other three ports, forming three connections through the switch. This means that overall operational bandwidth through the switch will be 15 Gbits/s or 1.875 Gbytes/s. A multicast to five ports would up node bandwidth to 9.375 Gbytes/s. Every port is made up of four LVDS, bidirectional, 622-Mbit/s differential pairs and can drive up to five meters of unshielded copper cables. Additionally, these ports are hot-plug capable.

StarGen's port bandwidth exceeds PCI bandwidths. To take advantage of the StarGen fabric, a PCI-endpoint implementation needs multiple switch-node layers to multiplex the slower PCI ports into the faster StarGen ports.

StarGen really shines as a backplane itself. Its ports can be mapped to the CompactPCI backplane to form a full StarGen backplane with 24 ports on the J1 through J5 connectors, and a collective potential bandwidth of 45 Gbytes/s for a very high-performance data-transfer backplane for data-gathering and MP operations.

StarGen projects that its Star switch chips will be sampling by the second quarter of next year. Packaged in a 272-ball plastic BGA, every chip dissipates approximately 2.5 W. They will sell for less than $50 each in 1000-unit lots.

Virtual PCI As A Ring GigaBridge also implements a Virtual PCI connection subsystem, but on a very different hardware base (Fig. 4). It builds on a dual counter-rotating ring, which has built-in fail-safe redundancy. This capability permits the hardware to lock out a failing endpoint on the ring and still continue functioning. Plus at the board level, a GigaBridge also supports PCI and CompactPCI hot-swapping. The offending board can be hot-swapped out and replaced, enabling the system to recover to full operation. Alternatively, GigaBridge could be deployed as an on-board technology to interconnect high-performance elements on a board.

GigaBridge is fully PCI compatible. It provides a Virtual PCI adjunct bus to systems using PCI. It acts like a PCI bridge and is compatible with existing PCI system software. Every ring node can drive up to four PCI bus slots.

Interestingly, GigaBridge is being fielded by PLX Technology, another PCI bridge chip vendor. Originally, GigaBridge was developed as the Sebring Ring, featuring a dual counter-rotating ring. Each of these rings clocks at 400 MHz, with 16-bit LVDS pairs. The basic media bandwidth is 800 Mbytes/s per ring. But unlike a parallel bus, which can only hold one active transaction on its hardware bus, GigaBridge can accept multiple transactions due to its ring structure.

A node on a ring can insert a transaction when the ring at that point is unoccupied. That transaction moves down the ring until it reaches its destination, where it's taken off the ring medium, or else it returns to its sender. If the system is designed so that the source and destinations on the ring are fairly close together, a source-to-destination transaction won't take much of the ring's resources, leaving room on the ring for multiple transactions. (Think of a ring as a circular plastic tube with tennis-ball transactions moving between node stations, except that these can be elongated tennis balls. The space between the tennis balls is open to accepting a new transaction, or a new tennis ball.)

As a result, the ring can hold multiple transactions. PLX claims that the ring can support up to four simultaneous transactions, depending on how close the source nodes are to their destination nodes on the ring. Given the nature of many telecommunications and data-communications front-end applications, this may not be a bad assumption, as these tend to be data-flow applications, moving data through the system to be processed in a regular manner. Assuming that four concurrent transactions can be inserted on each ring (eight transactions for two rings), GigaBridge can deliver a 6.4-Gbyte/s bandwidth.

GigaBridge supports connections to 32-bit, 33-MHz or 66-MHz PCI. Assuming an average of four simultaneous PCI transactions per ring, the dual counter-rotating ring can support up to 48 32-bit, 33-MHz PCI connections. But PCI is relatively inefficient, enabling even more PCI connections on the ring, depending on system latency requirements. The GigaBridge system has built-in addressing to support up to 224 PCI bus segments with up to 896 PCI slots.

PCI connections to the ring, called nodes, can belong to a cluster, with up to 32 nodes sharing a cluster ID. Ring nodes are able to communicate over the ring with other nodes in their cluster. The only other communicating taking place on the ring is to their nearest neighbor, the next node down on the ring. The ring clusters also support up to seven domains. These are separate PCI address spaces that support a separate PCI host, each with a protected 64-bit address space. This is good news for MP applications that require separate independent processor spaces.

Companies Mentioned In This Report
Compaq (800) 345-1518 www.compaq.com Hewlett-Packard (800) 752-0900 www.hp.com IBM (914) 499-1900 www.ibm.com InfiniBand Trade Association (503) 291-2565 www.infinibandta.org Intel Corp. (408) 785-8080 www.intel.com Mercury Computer Systems (978) 256-1300 www.mc.com	Motorola Semiconductor Products Sector (512) 895-3131 www.mot-sps.com PLX Technology (408) 774-9060 www.plxtechnology.com RapidIO Trade Association www.RapidIO.org Sky Computer (978) 250-1920 www.skycomputers.com StarGen (508) 786-9950 www.stargen.com Sun Microsystems Inc. (650) 960-1300 www.sun.com