InfiniBand is on its way to becoming a mainstream switch fabric. The key to implementing InfiniBand is its switch nodes, the individual nodes that make up the warp and woof of the interconnection fabric. The good news is that Red Switch is fielding an eight-port switch node that delivers 160-Gbit/s interconnection bandwidth. This provides a single-chip solution integrating the switch-node logic with the link-port serializer-deserializer (SERDES) and the physical-layer (PHY) interface (Fig. 1).
An InfiniBand fabric consists of end nodes—the servers, peripherals, and subsystems on the edge of the fabric, and the switch nodes that make up the fabric. These switch nodes provide the routing paths through the fabric, connecting end nodes in dynamic point-to-point links. They are the core of the fabric. Each switch node, a concurrent switching mechanism, can support multiple connections. Together, the nodes provide a dynamic link between end nodes, supporting flexible high-bandwidth transactions.
The InfiniBand switch nodes implement the InfiniBand link and transport routing protocols. These nodes support high-speed serial LVDS connections at 2.5 Gbits/s (full-duplex 1.25 GHz in each direction). The connections can be bundled in 1X, 4X, 8X, and 12X configurations. The HDMP-2840 chip delivers the first 32-channel switch node that supports 1X and 4X bundled connections. The 4X bundle combines four 2.5-Gbit/s channels into one 10.0-Gbit/s InfiniBand port. It supports SMA, PMA, and BMA protocols, and enables SNMP management.
The chip is a full InfiniBand switch node implemented in hardware state machines with supporting register sets. It incorporates the SERDES and PHY interfaces (Agilent IP) for single-chip deployment. The chip's logic runs with an internal 250-MHz clock, providing about 20 logic levels between clock edges. This very sophisticated hardware design implements the InfiniBand procedural definitions of the link protocol in register-transfer-level-defined logic and state machines. The four major blocks are:
Link-I/O ports: There are 32 I/O channels. The hardware is configured into eight 4X-link ports. Each of these 4X-port sets can be configured into a 1X-link or one 4X-link bundle. Each 4X set has its own 20-kbyte input buffer and a 5-kbyte output FIFO. Basic link-level protocol processing is handled in state-machine logic at the Link-I/O port.
Arbiter: The central control for the switch node. It acts like a bus arbiter, granting access to a requesting input node to transfer packets through the dynamic crossbar to an output node. It also handles packet routing and virtual-lane (VL) assignments.
Crossbar switch: The dynamic crossbar that connects all ports to one another, providing multiple concurrent connections. It also connects the ports to a management port and a test port.
Management port: It handles InfiniBand device management functions. Also, it includes a separate bus port to an optional external CPU, plus an I2C serial port. It provides access for a local management controller and supports subnet, performance, and baseboard management packets.
The chip can be deployed as InfiniBand switch nodes that make up the fabric core. Each switch node has a throughput delay of roughly 100 ns with a 95% throughput efficiency. So the fundamental limit, aside from Local ID (LID) addressing, is the amount of latency that a system can tolerate. Systems of 16,383 nodes can be implemented. Each switch node links to other switch nodes, or to end nodes. The link protocol is the same for both links.
Another application is using the switch for a high-performance peripheral system backplane (Fig. 2). This approach is particularly effective for storage systems made up of multiple storage subsystems, like RAID, JOB, and others.
The 4X ports can move data at 10 Gbits/s or 1.25 Gbytes/s between the switch and any storage peripheral, a bandwidth greater than any standard bus backplane. It can support concurrent transactions between peripherals, along with an I/O connection for intrasystem access to the peripherals.
An InfiniBand-based peripheral system would integrate a host CPU with its memory, and an HDMP-2840 switch-node chip with up to three storage peripheral subsystems and an I/O subsystem (four 10-Gbit/s links). This system supports high-bandwidth data transfers from any of the peripheral subsystems to an I/O subsystem at 10 Gbits/s.
Moreover, this I/O subsystem can support multiple I/O connections, such as SCSI, Ethernet, and FibreChannel, to external systems and networks. The CPU links to the switch node and peripherals via the node's CPU interface. Using multiple nodes, designers can build a multilayer switch backplane supporting more peripherals.
Or, designers can use these switch nodes to implement line-card backplanes. A single switch node supports up to eight line-card links, each with a 10-Gbit/s (1-Gbyte/s) peak bandwidth link. More complex backplane switch fabrics can be done using multiple switch levels.
For serial switch backplanes, InfiniBand's advantage lies in building on an existing standard, rather than on a proprietary switch backplane. Better yet, this switch-fabric technology can be deployed at a higher level to link the line-interface box to the system or to other systems.
Fundamentally, a switch node acts like a multiported switch. Inputs come in, are buffered, and if okay, are sent to an output port for transmission to the next node in the input packet's addressing path. Although this sounds simple, it's complex. InfiniBand defines a switch fabric that enables one end node to address another end node, creating a node-to-node path through the fabric to the destination end node.
On initialization, fabric paths are defined between end nodes, with the intervening nodes building node tables. These tables route a packet to the appropriate output port, and then to the next node based on the packet's LID address for the target end node.
In the switch fabric, an incoming packet's destination LID is compared to the addresses in the routing table. If it matches one, the packet is routed to the appropriate output port via the node's internal switch fabric. If not, then the packet is dropped.
More complexity comes in with InfiniBand's multicast—the transmission of a single message or packet to multiple recipients (group LID). To handle this, the individual switch node converts the multicast transaction packet to multiple unicast operations—one to each output port. These packets are sent to the downstream nodes, which also transform multicast to unicast packets (except for end nodes, which accept or reject it).
Each I/O-link port supports both input and output operations. These can be concurrent, as the lines are full duplex. The basic line consists of a differential pair of 2.5-Gbit/s LVDS signaling (duplex) lines. Each port includes an elastic buffer between the physical and link layers. It handles data rate problems between transmitters and receivers with different clock rates, as well as skew between differential pairs.
The input buffers have three read ports, allowing three concurrent reads per clock cycle. This enables transfers of up to 96 bits (three 32-bit words) over the crossbar switch.
The Link-I/O port implements the basic link-level protocol in hardware. It does the basic packet processing, including packet decoding, packet checks (CRC, length, VL, buffer credits, and packet operand), up-front flow control (send and receive flow-control credit requests and responses), and packet assembly.
The Arbiter controls packet transfers. It consists of two parts, the Priority Selector and the Resource Allocator. Also, it handles packet transfer requests and credit update requests. It processes each request and either sends it on as a grant or enqueues it for later action.
The highest priority request is passed on to the Resource Allocator. If the resources are available, the allocator grants the request to transfer a packet over the crossbar to its addressed output port. If the crossbar re-sources aren't available, the request will be enqueued to wait for resources to free up.
In the Priority Selector, the request packet's destination address is translated into an output port number for the crossbar. If the address is for a multicast group, the logic will create unicast requests for processing.
The Routing Table provides the addressing information. The Table is filled during node initialization and contains all end-node-to-end-node routing, specifying the output port to route a packet to a specific end node.
Using the request's input port, assigned output port, and Service Level (in the InfiniBand packet), the logic specifies the output VL for the request.
InfiniBand supports up to 16 VLs (VL 15 is the management lane). The chip supports eight VLs, plus VL15. The VLs provide priority encoding for packets to share a common link bundle, and to transmit multiple packets in a TDM fashion on the link bundle by VL priority.
Input packets that use the 1X configuration and that output via 4X-link ports have a built-in timing problem. The output port is four times faster than the input port. The problem is handled by temporarily buffering the packet in a Store-&-Forward Buffer and then processing the transfer request for the packet when it's complete. The Resource Allocator maintains a New Request Queue, Output Request Queues, and Input Request Queues. Each link-port bundle has an Output and Input Request Queue.
The Request Selector passes the highest priority request to the Resource Allocator, which checks for available resources to honor the request. Are the flow-control credits available for sending the packet to the next node? Are the target output port and the source read port available? If so, the Allocator grants a request to use the crossbar switch.
If either the flow control credits or the output port isn't available, then the request is enqueued in the Output Request Queue. If available, but the source's input buffer read port (each buffer has three read ports) isn't available, the request is enqueued in the Input Request Queue.
As flow control credits and output ports become available, the Selector picks a request from the Output Request Queue, using InfiniBand's VL arbitration scheme. Similarly, as Input Buffer Read Ports open up, the logic chooses a pending Input Request Queue via a simple round-robin selection algorithm.
Crossbar Switch (Hub)
The crossbar is the core switch of the node, connecting all the ports. It functions like a dynamic bus with multiple concurrent connections.
The Arbiter grants the input ports crossbar access. This access can last more than one cycle. It defines a dynamic connection that holds until the packet is rejected or until the packet transfer over the crossbar switch is completed. The crossbar switch supports up to 10 dynamic connections concurrently: one to each of the eight InfiniBand ports, one to the Management port, and one to the Functional BIST port during testing. New requests are queued until one of those 32-bit wide connections frees up.
The crossbar switch handles bad packets by discarding them, or by truncating them, depending on the packet-transfer mode. Incoming packets in the cut-through mode are checked for integrity, and if okay, are passed through the crossbar. Packets with errors are truncated, and a bad packet deliminator and vCRC field are appended to the packet. Packets transferred in the Store-&-Forward mode with errors are truncated.
The HDMP-2840 supports all InfiniBand management agents, either directly in hardware, or indirectly via an external management processor. The Management block processes all InfiniBand management packets, including the Subnet Management, Performance Management, and Baseboard Management packets. As an option, these packets can be transferred to the external processor for off-chip processing.
The Management block management tasks include:
- Initialization—initializes the switch node, including setting up the routing tables and VLs.
- Decoding—decodes and dispatches packets to their owners.
- Grant Control—handles both unsolicited and solicited Grant signals from the switch Arbiter.
- Request Control—submits requests to the Arbiter for Management Port packets for packet routing.
- Flow Control Buffer—manages credits for the VL (VL0) packets.
- Management Port IAL—manages the Internal Access Loop for chip test.
Price & Availability
The HDMP-2840 samples in November, with production in the second quarter of 2002. It costs $600 in 1000-unit lots.
Red Switch, 1815 McCandless Dr., Milpitas, CA 95035; (408) 719-4888; fax (408) 719-4800; www.redswitch.com.