Backplane Switch-Fabric ICs Go To The Next Level

This is the last Special Report that Ray Weiss, who passed away on New Year's Eve, prepared for us. We will miss his insight, his expertise, and his friendship. (See our tribute to him in our February 4, 2002 issue, p. 7.)—ED.

Silicon switch fabrics will form the core of next-generation mid- to high-end network switches and routers. Merchant switch-fabric chips are increasingly replacing proprietary ASICs and older switch fabrics. Next-generation switches and routers will deploy in one to two years and build on emerging, high-performance backplane switch-fabric chip sets with throughputs of 160 Gbits/s to Terabits/s. They will support 2.5-Gbit/s OC-48, 10-Gbit/s OC-192, and 10-Gbit/s Ethernet, with headroom for future line speeds.

Backplane switch fabrics provide the switching needed between arrays of line cards for network switches and routers. Generally deployed as chip sets, they deliver cost-effective performance for layers 2 and 3 switching and routing. They also provide an upward migration path from OC-12's 622-Mbit/s and OC-48's 2.5-Gbit/s line speeds, to higher-bandwidth lines such as OC-192's 10 Gbits/s and future 40-Gbit/s OC-768.

A migration path is needed. It takes one to two years to design, test, and initially deploy telecom switches and routers, which in turn have field lives of seven to 10 years. Thus, there's a dichotomy between the telecom switch and router life cycle and silicon ICs that, by Moore's law, double every 18 months or so in performance and functionality. Consequentially, silicon doesn't continually move into the telecom sector. Instead, silicon insertion moves in spurts and jumps where each insertion defines architectures that must serve multiple generations of line speeds.

A window is now open for current silicon insertion, providing the technology base for next-generation mid- to high-end switch and router architectures. To fill that window, new backplane switch-fabric chips, some on 0.13-µm CMOS, are coming online. Switch/router designers have a range of switch-fabric cores to choose from.

Backplane switch fabrics are a specialized form of switch fabric dedicated to a single task: connecting line cards via a switch across a virtual backplane. Traffic from one line card is moved through a switch and passed to the proper output line card—hence, the term "backplane switch fabric." But multiple backplanes, and even multiple boxes, can be in a chassis for a switch—thus the term "virtual backplane." These boxes can even be separated. For example, Mindspeed's Cx27300 chip set supports box interconnections of up to 30 m. Plus, PMC-Sierra's ETT1 chip set supports up to 70-m connections with its LCS protocol.

Switches and routers switch inputs to outputs. Switches work at a lower level, usually with common subunits as in a LAN. A router classically switches between different subunit classes, like LANs to MANs, the core to MANs, and so on. Also, both switches and routers have some form of quality-of-service (QoS) flow control. It may go beyond the classic OSI layer 2 or 3 processing definitions. Switches and routers are supported by backplane switch fabrics.

Conceptually, switching and routing are simple. Just take incoming traffic— time-division multiplexing (TDM), ATM, IP, Frame Relay, and Gigabit Ethernet—and encapsulate it into one or more packets or cells. Then, switch it to the addressed output line card, where it's reformatted and transmitted. A classic router architecture contains multiple line cards, within onboard queue managers (QMs), that connect to a central switching mechanism (Fig. 1).

Newer switch fabrics are protocol agnostic: they handle multiple protocols, first converting them into an internal cell or packet format for switching, then converting them to the output protocol. Switches and routers thus bridge traffic between different protocols.

Line cards and a central switch compose those switch/router systems. The line cards take in line-speed data, convert it (PHY MAC, or physical-layer MAC), and feed it to a processing block made up of a network processor (IC or ASIC) or a traffic manger. The block outputs the traffic as standardized cells in a format like CSIX or SPI-4, which are fed into the switching system.

Typically, the switching system consists of a front-end/back-end QM and a switching mechanism. Generally, the line card includes the QM, which links to one or more switching cards via the backplane. Some switch fabrics have the QM on the switch card. Others, like IBM's PowerPRS Q-64G switch chip and Internet Machines' CE200, integrate the QM into the switch chip itself.

A high-speed serial bus generally connects the QM and the central switch to minimize connection pin count—critical because most backplane switch fabrics are single-level switches comprising stacked switch chips. For a line card input to be switched to every possible output port, it must have a link to each switch chip in that switch stack.

Silicon has simplified switching implementations. Most newer backplane switch fabrics have integrated the serializer/deserializer (SERDES) into their QM(s) and the central switch chips. Also, higher silicon and packaging densities have led to wider output sets and larger crossbar switch arrays. QMs typically now field 32 or more serial outputs. Switches with 32-by-32 or 64-by-64 arrays are becoming the norm.

Not all fabrics break down into QMs and switch chips. Internet Machines' CE200 and others integrate everything onto a single stackable switch chip. Some, such as IBM's PowerPRS Q-64G, move the queues to the switch element. Others do a low-level switch implementation of their chip set. For instance, Agere's PI140xx switch fabric includes the PI140XS, a standalone switch that has queues with 32 ports supporting a 40-Gbit/s switch.

Queue Management: The goal is to keep the cells flowing from ingress to egress line cards to meet QoS requirements for each kind of line traffic. Difficulties arise when more than one ingress cell addresses the same egress output. Most fabrics rely on a backpressure feedback loop to control cell flow. If the output queue is full or filling up, cell loss or stalling is avoided by simply not sending the cell. So if there's no room for the ingress cell at its addressed line-card output queue, the ingress QM doesn't send the packet.

This isn't difficult. Typically, the same elements, QMs, make up front and back ends of the switching system. The QMs buffer the arriving ingress cells into Virtual output Queues (VoQs), which may be further divided into priority levels for output. On the egress side, QMs also are buffered with one or more queues. Basically, on ingress, QMs demultiplex the stream, enqueuing the generated cells. On egress, they multiplex the cells, creating a traffic stream (Fig. 2).

When there's capacity to receive ingress cells, they're transmitted to the switch addressed to an output port. But if the output port's buffers are full, information is fed back to the ingress QM to halt cell transmission. This feedback to control cell flow is called "back pressure" control. The output port buffer signals when it's full or filling up; that information is sent to the switch chip's schedulers or central arbiter.

When an ingress port wants to transmit a cell to a specific egress output buffer, it sends a request to the switch. If the buffer space is available, the switch grants the request and permits the transfer. Most fabrics use this basic switch Request/Grant mechanism.

PetaSwitch's Pisces has refined the Request/Grant procedure by adding a required acknowledge to the sequence for finer control. It enables the requesting QM to ensure that a request went through and to turn off the request if needed. Ingress and egress queues in a QM respond to grants and receive the cell traffic from the switch (Fig. 3).

ZettaCom's Zest switch-fabric chip set adds a port multiplexer to its QM chip and switch-chip set. The Zest-MUX-250 chip acts as an expander to the Zest-IXS-250 switch, which integrates a crossbar switch with a scheduler. It expands the 32-port switch chip to support 64 line cards at full-duplex OC-192 line speeds.

Switching: Most switch fabrics build on a lossless, memoryless switch. When a cell transfer is granted, cells are sent to the switch, then switched to the proper output port. The switch doesn't en-queue the cells or store-and-forward them. Rather, it acts as a simple crossbar switch and transfers cells to their specified connections. All transfer queuing occurs in the front-end QM, and all output buffering in the back-end QM.

For larger switches, multiple switch chips are stacked, each connecting to links. Most designs tend toward distributed arbitration, with each switch chip using its own internal scheduler/arbiter. This makes it easier to scale the design and add more switch chips. Fabrics with independent distributed arbitration include Agere's PI40x, Internet Machines' SE200, Mindspeed's CX-27300, Vitesse's TeraStream, and ZettaCom's Zest-250 fabrics.

Some switch chips like IBM's PowerPRS Q64G and PMC-Sierra's ETT1 switch fabrics use a common or central arbiter for finer control. Others, like the TeraChannel from Power-X Networks, do both with high-level common arbitration and local arbitration.

Some fabrics use switch elements that do more than provide a logical crossbar connection. They tend to use memory to enqueue ingress cells and take on many tasks of front-end QMs. These are more complex chips but require less sophisticated support chips. Memory on the switch chip lets the chips make more intelligent decisions on routing, like dealing with QoS priority levels.

However, a central shared memory, when used per switch chip or for the whole switch (multiple switch chips), must serve all switch input port transactions and all output port retrievals. Even with pipelining, that can slow overall cell throughput through a switch.

One memory-based switch, IBM's PowerPRS Q-64G, is supported by the PowerPRS C192 switch-interface chip. The C192 takes in CSIX frames and ships them to the switch over serial lines. The switch does its own internal queuing. Multiple switch chips can be stacked to create a 512-Gbit/s switch with 32 OC-192 ports, or 32-by-32 OC-192 ports (eight chips). When stacked, one chip is a master; the rest are slaves.

The IBM switch chip has 32 input and 32 output controllers—one for each port. Every chip has four self-routing subswitch elements, each 16 by 16, with a shared memory bank (1024 by 10 bytes). The bank comprises two read and two write ports. A sequencer controls the switch elements, sharing memory access between input and output ports. It grants access to two input and two output ports per cycle.

Vitesse took a different approach with its TeraStream chip set of a switch chip and fore and aft QMs. The switch chip distributes on-chip memory to crossbar crosspoint connections, where it provides unicast and multicast queues at each node. This uses less memory and puts queue action at the connection nodes for fast access and throughput.

Another memory-based switch fabric is Internet Machines' SE200 with a standalone switch. The switch chip holds both the QM and switch. One chip with 64-by-64 ports supports a 40-Gbit/s switch fabric. Chips can be stacked to deliver a 200-Gbit/s switch fabric (full-duplex measured one way). Stacked, it can handle 64 OC-192 ports. The chip supports OC-48, OC-192, and OC-768 rates, bundling 2.5-Gbit/s ports to aggregate the line speed.

The chip's 4-by-64 switch element is nonblocking and supports 64 output queues, dynamically allocated in 256 kbytes of on-chip memory, arranged in 2 kwords of 128 bytes each. When a cell arrives, it's stored in 128-byte chunks. Each chip can enqueue up to 2048 cells, if the cells are 128 bytes or less. Each output queue is a FIFO with up to 1 kpointer to stored memory (Fig. 4).

Switches and routers are demanding higher QoS levels. This includes guaranteed bandwidth for real-time traffic, prioritized delivery for high-priority traffic, reliable delivery, in-order cells, and differing levels of delivery service, requirements that go beyond the classic OSI layer 2 and definitions. Many requirements depend on different levels of delivery service that can be sold to customers on an end-to-end basis. As a result, next-generation switches and routers tend to differ only in where they're deployed in the network hierarchy and what subnets/nets they link into. The same switch-fabric hardware can now serve as a switch or a router, and as bridge between different protocols and line speeds.

QoS and Classes of Service (CoS) are typically performed by the switch fabric through priority levels. There can be some specialized paths and links for some traffic (such as the PMC-Sierra ETT1 chip set) that provide a separate set of queues and a switch path for TDM traffic. More priority levels and specialty groups translate into priority differentiation for the incoming traffic.

Most newer and emerging switch fabrics provide four to eight priority levels. Some, such as Vitesse's TeraSteam, support up to 16 for even finer cell-control granularity. These levels are typically implemented as expanded VoQ queues to queue up ingress cells for transmission through the fabric. VoQ queues are implemented on a per-port basis. For example, each input port has a full set of VoQ queues to hold incoming cells as they await a grant from the switch. These queues are generally implemented in the switch fabric's QMs.

Ingress priority queues hold the ingress messages for transmission through the fabric. They're selected for the Request/Grant sequence by a variety of priority implementations. Most fabrics perform a two-level selection process. A programmable number of the higher-priority queues are handled in strict priority fashion. The remaining lower-priority queues support a weighted round-robin selection scheme.

PowerX's Star-2000 QM supports 16 ports with four priority levels, each with four service and one multicast channel. Each port has 16 separate unicast queues (four priority by four service channels) and four multicast queues. With a port, 16 flows can be individually assigned bandwidth based on order of importance. Arbitration for the 321 VoQs (16 × 4 × 4 = 1 broadcast channel) is distributed at the central arbiter for priority and at the VM for service level.

Because each VM serves as an ingress and egress port for its line card, it also implements the output port or egress queues. These generally aren't broken down into priority queues, as are ingress (VoQ) queues.

Throughput: Emerging backplane switch-fabric chip sets support higher line throughput rates and higher aggregate throughputs. For most sets, a minimal configuration supports a 40-Gbit/s throughput (full-duplex, one way). Building on that base, designers can easily achieve 160- to 320-Gbit/s throughput switches, especially with OC-192 ports. Some of these chip sets, such as Agere's PI140xx and PetaSwitch's Pisces, qualify for high-end switches/routers and have aggregate throughput rates of up to 2.5 and 5.2 Tbits/s, respectively.

But to meet the switch fabric's external I/O rates, internal system rates need higher internal bandwidths. Cell transfers within the switch fabric take on overhead ranging from simple matters, like cell header overhead, to more complex functions such as cell enqueuing and system cell flow control.

To handle overhead, most switch fabrics have internal bandwidths of 2× or more. This "speedup" depends on how many switches are stacked. Adding chips lowers contention and ups speed.

Need More Information?
Agere Systems (800) 372-2447 www.agere.com Applied Micro Circuits (MMC Networks) (858) 535-4260 www.mmcnet.com Caspian Networks (408) 382-5200 www.caspiannetworks.com IBM Microelectronics (845) 892-5389 www.chips.ibm.com Internet Machines (818) 575-2175 www.internetmachines.com Mindspeed Technologies (800) 854-8099 www.mindspeed.com	PetaSwitch Technology (408) 470-7700 www.peta-switch.com PMC-Sierra (408) 239-8000 www.pmc-sierra.com SiberCore Technologies (613) 271-8100 www.sibercore.com Power-X Networks (613) 724-6004 www.powerxnetworks.com Vitesse Semiconductor (805) 388-7452 www.vitesse.com ZettaCom Inc. (408) 869-7000 www.zettacom.com