InfiniBand is on its way to becoming the next-generation interconnect for subsystems, systems, and networked systems. Developed by Intel and major server vendors such as Compaq, HP, and IBM, InfiniBand supports multiswitch subnet environments, with built-in routing between them. It delivers an extensible, high-reliability environment. The basic link bandwidth is 2.5 Gbits/s, and links can have one, four, or 12 multiples delivering raw bandwidth rates from 500 Mbits/s to 6 Mbytes/s per link port.
This one switch fabric can serve as an interconnection mechanism for interbox I/O, subsystem I/O, networked I/O, and SANs, as well as for multiple systems. It's not a toy architecture. InfiniBand supports multiple subnets with thousands of nodes per subset and routing between subnets. Additionally, InfiniBand implements QoS, switch-fabric, and switch-node management.
The emerging switch fabrics integrate two major technology successes of the '90s: high-speed serial or pseudoserial buses, and crossbar switches. A switch fabric like InfiniBand consists of end nodes—host computers and peripheral subsystems—that link into the fabric, which is made up of interconnected switch nodes. End-node transactions address other end nodes. Messages, consisting of one or more message packets, make up these transactions.
For transactions, InfiniBand supports point-to-point connections between end nodes. The transaction's data messages and packets are passed though the switch fabric from transmitting to receiving end nodes, and unlike a busing system, multiple end nodes can concurrently conduct transactions, delivering a high aggregate bandwidth. Plus, the switch fabric employs multiple transaction types and service levels, ranging from Reliable Connections and Datagrams to Unreliable Multicasts.
Host Channel/Target Channel Adapters
The switch fabric supports hosts and targets, each with its own type of channel adapter (CA). The host channel adapter (HCA) supports a higher-level channel interface to the switch fabric. This channel interface offloads I/O processing from the host CPU, similar to the earlier IOCs for IBM mainframes. It implements a high-level interface with verbs defining basic operations on the underlying fabric. This interface builds on the virtual interface (VI) that deploys a low-overhead I/O interface, bypassing the OS stack.
The HCA connects the fabric to the host memory. It provides remote DMA (RDMA) engines and supports memory translation and protection. On the target side,the target channel adapter (TCA) links peripherals and other subsystems to the switch fabric. This is a simpler CA that doesn't use the IB verbs.
InfiniBand Layered Architecture
InfiniBand handles multiple subnets, i.e., collections of IB end nodes and switch nodes. It features routing capability between different subnets, too. IB's architecture layers include:
- Upper-layer protocols: support transactions between IB end nodes. These are implemented as verbs in the HCA, which are called by the application programs. Transactions consist of one or more messages.
- Transport layer: responsible for basic IB messaging between nodes. These messages contain at least one message packet.
- Network layer: handles subnet-to-subnet routing based on a global ID (GID). It routes messages from one subnet to another based on global addressing.
- Link layer: carries out the intersubnet routing or switching between local nodes based on the local ID address. Multiple nodes make up the subnet switch.
- Physical layer: handles the physical connection and links.
The HCA verbs provide an abstract functional interface to the underlying switch fabric. The verbs interface wasn't intended as a complete API. Instead, the OS vendors will fill out the API, building around the verbs, which include work requests to the HCA and TCA to implement transactions.
All message transactions in the switch fabric are handled via work requests sent to the channel adapters. These requests include Send, RDMA Write, RDMA Read, Atomic Operations, and Bind Memory Window. The Atomic Operations are Compare & Swap and Fetch & Add. Upon receipt, requests are placed into the channel adapter's queue pairs—pairs of Transmit and Receive queues. For each work request through the channel interface, a queue pair is created. It queues up the messages for transmission and receives incoming messages. Upon completion of the work request, the hardware generates a completion notification that's sent to the CA's completion queue (CQ).
InfiniBand fabric nodes belong to one or more partitions. A node can't address one outside of its partition, although nodes from one or more subnets can be in a common partition. Each partition has its own partition manager. Every end node contains a partition key table to hold the p-keys for up to four end nodes that the node can address.
A transaction comprises one or more messages, each made up of at least one messaging packet that includes both global and logical (local) addressing. The subnet manager assigns each InfiniBand HCA or TCA port one or more logical IDs (LIDs). This is a 16-bit local identifier, unique in the subnet.
Moreover, each HCA or TCA port has one or more GIDs, or valid 128-bit IPv6 addresses. The GID is a concatenation of a 64-bit subnet prefix, an identifier of the subnet, and an EUI-64 address, a manufacturer-assigned, globally unique ID. Also, ports may have additional EUI-64 values assigned by the subnet manager to create more GIDs.
Switch nodes are treated differently. The subnet manager assigns an LID and a GID to the switch, but not to its ports. Each switch node has a manufacturer-supplied EUI-64 address. For router nodes, the SM assigns a base unicast LID and additional (sequential) LIDs for individual router ports.
InfiniBand supports six transport services, ranging from a Reliable Connection and Datagram to Unreliable Connection and Datagram, as well as Raw IPv6 and Ethertype packets. Reliable Connections guarantee data delivery and data order. The hardware generates an ACK for every packet received and generates/checks packet sequence numbers. It rejects duplicates and detects missing packets. Errors are detected at both the requester and responder, and the requester can recover from errors, including trying alternate paths, without involving the client application. In contrast, Unreliable Connections and Datagrams don't guarantee data delivery or data order, nor implement error recovery.
Service Levels, Virtual Lanes & QoS
InfiniBand uses service levels to define a QoS for a transaction's message packets. At the switch level, where the packets are moving though the fabric, the service levels define the message priority and mapping onto the switch port's virtual lanes—a mechanism that partitions the port's throughput into a TDM-like bus with multiple slots or lanes that can be assigned by the switch. Each switch can support up to 15 virtual lanes, with lane 15 assigned for subnet management.
Basically, the incoming packets are queued up by their SL-to-VL assignments, and the queues are emptied through the port line in priority order. Either a credit-based (set by credits from receiver to transmitter) or a static injection-rate control algorithm (set by a defined rate limit) controls the packet flow between the transmitting switch port and the next receiving node in the fabric.
Most parallel bus systems are memory addressed. Transaction addresses (memory or I/O space) identify the bus board to be activated. But a system switch fabric covers a larger domain, with multiple memory spaces. So, simple memory addressing can't be used to access a specific end node.
Each HCA, for example, can have its own memory space. InfiniBand handles addressing of these spaces by allowing an HCA (and TCA) to register access to blocks in its space. With permission, these blocks also can be accessed by the HCA, or by HCAs or TCAs linked to them via the switch fabric.
Sets of memory locations can be registered as memory regions. They provide local access rights for Local Reads and for remote access (Remote Read/Write/Atomic). When the memory region is registered, local and remote access keys and access contexts are assigned. These include:
- L_Key: local access key
- R_Key: remote RDMA access key
- Virtual address: address to location in memory region
Memory protection is extended to memory regions and queue pairs via protection domains. Each memory region belongs to one protection domain, although multiple regions can belong to a single protection domain.
The memory regions and protection domains are assigned/changed through a registration/deregistration/reregistration process that might have a performance penalty. For faster, more-dynamic memory access control, InfiniBand provides memory windows by which consumers dynamically grant and revoke remote access rights.
InfiniBand builds management functions into the switch fabric. In the general model, the manager has a messaging link to the agent of the managed object. Management occurs via Management Datagrams (MADs) sent to the managed object's agent. Subnet and General Services are part of InfiniBand Management Services, which includes Device, Baseboard, SNMP, Performance, Connection, and Fabric management.
The MADs are standard "send only" Unreliable Datagram packets with a 256-byte data payload. Reserved for the subnet manager's usage are QP0 and VL15, while QP1 serves as the General Services interface. Common management methods include Get/Return Register Values, Set Register Values, Send a Message, Notify an Event Occurrence, Stop Event Notification, Forward an Event, and Acknowledge an Event Report.
The fabric manager (FM) supplies key services for initializing and configuring an IB subnet. Multiple FMs can be defined in a subnet, but only one is master at a time. The FM discovers the physical topology of the fabric, assigns logical identifiers for switch, end, and router nodes, establishes routing among the nodes, and manages changes as nodes are added or removed. An FM uses a specialized MAD packet—the fabric management packets (FMPs)—to monitor and control operations in switches, HCAs, and TCAs. FMP methods include Set, Get, Action, and Trap.
• Reliable Connection: RDMA Read/Write, Send, Atomic, multipacket messages, 1-to-1 QPs
• Reliable Datagram: RDMA Read/Write, Send, Atomic, 1-to-many QPs
• Unreliable Datagram: Send, no Atomic, 1-to-many QPs
• Unreliable Connection: RDMA Write, Send, 1-to-1 QPs
• Unreliable Multicast: automatic replication of packets in switches/routers, Send
• Raw Datagram: Ethertype, IPv6, and encapsulated packets