Router-On-A-Chip Manages Network Traffic With Wire-Speed QoS

Stuffing 16-Gbit Ethernet ports into a single IC is no mean feat. But doing this created the CXE-16, a so-called router or switch on a chip. Even more impressive, the IC is fabricated in 0.25-µm CMOS technology. According to SwitchCore, the Swedish-based company that developed the IC, the CXE-16 replaces upwards of 25 chips used in some router or switch designs.

The company has integrated onto a single chip all of the media-access controllers (MACs), switch engines, and memory needed for a 16-port Gigabit Ethernet router. Data coming into the receive MACs travels through a path that includes a serial-to-parallel converter, shared-buffer memory, and a parallel-to-serial converter. It exits through transmit MACs (Fig. 1).

The CXE-16 can handle 16-ports of full-line-rate gigabit traffic. All packets are queued on the transmit side to avoid head-of-line (HOL) blocking. To deal with overflows in the shared-buffer memory, the chip employs a RAMBUS interface for external RAMBUS memory.

It also contains 8k addresses of content-addressable memory (CAM). This amount is typical for the workgroup or small enterprise. Larger applications, of course, may require more CAM entries. Up to seven external CAMs can be added to these applications for a total of 232,000 addresses (Fig. 2).

To achieve the high integration level of the CXE-16, the company uses full-custom design, placing transistors on the chip in a very logical and orderly fashion. This methodology has several benefits, including full utilization of performance potential, power savings, and process opportunities.

Unlike a semi-custom design, the full-custom procedure optimizes the chip's routing or layout. This method distributes the clock signal, too. Clock paths can be placed right next to datapaths and, for the most part, clock and data remain synchronized.

Generally, full-custom chips are more compact. SwitchCore believes designers may use 0.25-µm full-custom technology in situations where they currently employ a 0.18-µm semi-custom technology. But overall performance will be better.

Beyond its 16 ports of 10-/100-/1000-Mbit Ethernet, the CXE-16 hosts many features. It has the ability to filter and classify data for layers 2, 3, and 4 of the International Standards Organization's Open Systems Interconnect (OSI) reference model. Consequently, it guarantees both bandwidth and quality of service (QoS) based on programmable options.

Users will find that the device is very flexible. With eight queues in each output port, QoS can be split into eight groups per output. The 512-kbyte on-chip CAM can be extended by using another SwitchCore chip, the CXE-5000.

If an external buffer is used, it would have RAMBUS memory. In the process, data transfers at 1.6 Gbytes/s. A lot of data goes through this switch, which is why RAMBUS is used for fast buffering.

The chip also has automatic MAC address learning. It immediately learns the L2 address of any new device plugged into it. The CXE-16 boasts full 4-kbyte VLAN support as well.

On the chip, the interface is "glueless" to a Motorola PowerPC. Because it's a 32-bit interface, it can be used with other CPUs. However, it may need some glue logic.

The device offers port mirroring for testing and debugging, along with link aggregation for extra bandwidth on a certain link. Such aggregation is called port trunking. Spanning-tree and flow-control support also are in the CXE-16, as are statistics counters for SNMP, RMON, and SMON protocols. The physical-layer (PHY) interfaces are designed to connect gluelessly at 1-Gbit/s to a fiber PHY. The GMII and MII interfaces gluelessly attach to a copper PHY.

Low-End Applications Also SwitchCore also is unveiling the CXE-1000 chip. It's targeted at lower-end switches and workgroup and blade applications for 10/100 Mbits/s. It has 24 ports of 10/100 Mbits/s and four gigabit ports, which can operate at 10/100/1000 Mbits/s. Basically, instead of the 12-Gbit ports, there are 24 10-/100-Mbit/s ports.

The big advantage to using both chips simultaneously is that the software and programming interfaces are the same. Some potential customers are investigating doing very large, non-blocking Fast Ethernet applications. Many CXE-1000s are bolted onto a CXE-16 gigabit device, resulting in a bunch of Fast Ethernet ports to the outside world. It's a way of scaling the CXE-1000. Users can build up to 160 ports of non-blocking Ethernet.

Looking at the block diagram of the CXE architecture, it's clear that the datapath is on the right side. The left side is the section that does the parallel processing of the packet (Fig. 1, again). First, data comes into the MAC. After completing all of the usual MAC functions, it enters what SwitchCore calls the package decoder. This decoder extracts the L2, L3, and L4 information that's used in the parallel processing on the other side of the chip.

The packet then moves through the serial-to-parallel converter and continues into the 128 kbytes of on-chip buffer memory. The company believes that this amount of memory is sufficient to handle the traffic of 16-Gbit ports coming in and going out.

Once the header information is available, the parallel-processing side goes into action. This part of the chip contains several blocks, which include address lookup, classifier, queue engine, and prioritization.

When the packet is ready to be sent to the output, it's pulled from either the buffer or RAMBUS memory. This memory is converted from parallel to serial form, re-encapsulated to match the output-port network, and passed back through the MAC. The MAC carries out its usual functions and transmits the packet out to the PHY.

To better understand what the CXE architecture does, take a closer look at the packet decoder. This part of the chip extracts all of the layer 2, 3, and 4 information from the packet (Fig. 3). The decoder isn't limited to the first 64 bytes or even 128 bytes. It will go to any depth necessary within the packet.

The packet decoder looks as far as it needs to in the first four layers. There's no limiting factor, and that includes the packet type, encapsulation, and L3 information. Plus, everything is done at wire speed.

The company emphasizes that the classifier, prioritizer, and bandwidth distribution of the chip are its most important features. They're responsible for the chip's quality of service.

The classifier uses the information from the packet decoder to differentiate packets. It can tell whether they're IP, IPX, unicast, multicast, or broadcast. It can tell where the packet has been and where it is going. This differentiation helps ISPs, for example, to discover who is utilizing the service.

With layer 4 information, the classifier can pick out multimedia traffic that users may want to prioritize. The order of priorities depends on how users want to program this flexible device. The hooks are there, enabling users to do what they want. The facilities exist for them to classify the packets based on all of this information. To receive the real benefit of the QoS features on this chip, they'll have to configure it.

The first layer of the classifier houses three tables. Effectively, the classifier performs one lookup on a packet-type table, two lookups on a host-group table, and two lookups on a layer 4 table. These results are combined in a main-rule table. SwitchCore points out that the user programs all tables. After the classifier block finishes processing, it outputs a 7-bit result. This allows for 128 traffic classes, which are used for bandwidth distribution and functions like prioritization, filtering packets, and so forth.

Once the packet is classified, the prioritizer can use any of the following parameters to put the packet into one of the eight output queues: 802.1p, type-of-service (ToS) field, traffic class, IP multicast group, layer 2/3 destination address, and VLAN source address. A precedent can be defined between these parameters.

A user may want to give an IP multicast a very high priority for something such as a videoconference setup. If a particular packet wasn't in this group, the prioritizer would look for the next precedence set, for example, the traffic class. Based on the packet's traffic class, the prioritizer also can change its ToS bits. This may be used in a function like DiffServ. To provide QoS, this function changes the ToS bits in the packet.

The CXE architecture also guarantees bandwidth. A user can, for example, guarantee a certain service-level agreement through a particular part of the network. Since the chip can pick out the types of traffic, it's able to distribute the bandwidth based on the traffic class from the classifier.

The CXE architecture uses an algorithm, developed in-house, that's based on fair queuing. It is shown as weighted fair-hashed bandwidth distribution (WFHBD) in Figure 1.

By using that distribution technique, a user can program the equivalent of a weighted round-robin method on the output. Using this method, for example, 10 high-priority packets may be sent to the output, maybe only eight from the next priority queue, and then just one from the lowest-priority queue. Still, that low-priority queue at least gets a turn. This is in sharp contrast to strict priority, where a high-priority packet always is sent first, and a low-priority packet may never be sent. The key here is the fair way in which the chip defines the flow groups within the traffic class.

Moving on to software support, it's the company's belief that software plays an important role for the CXE family. SwitchCore is concerned about time-to-market issues and feels this will be one of the advantages of an off-the-shelf device. Available low-level software drivers basically perform device-level operations, such as register accesses.

The software is modular and, according to the company, simple to operate. The drivers were actually developed for a Motorola PowerPC, but the hooks for the processor are separate files. Users can work with a different processor if they want.

The company plans to supply the software for PowerPCs and the pSOS real-time operating system. But it also thinks that it's relatively easy to perform, for example, a port to MIPS or VxWorks. The basic idea is that because the software is very modular, users can actually pick out the pieces they want. SwitchCore also supplies the source code for the low-level software drivers, so users can get in at that level if they want.

SwitchCore anticipates that the CXE-16 will primarily be used in enterprise applications. The chip could reside on one of the blades inside an enterprise switch. Otherwise, it could function as the backplane interconnect, joining a large number of these blades together.

It also might operate, for example, in a server farm. In that case, some kind of low-profile "pizza box" style enclosure would connect it a number of servers with Gigabit Ethernet cards. According to the company, this would probably be the first application of CXE-16s, or all-gigabit switches, going directly to servers.

PRICE AND AVAILABILITYThe CXE-16 and CXE-1000 are sampling this quarter, with production starting in the fourth quarter of this year. Both chips come in an 836-pin EBGA package. The CXE-16 is priced at $950 each and the CXE-1000 at $450 each in quantities of 1000. The CXE-5000 CAM is sampling in the third quarter of this year, with production slated for the fourth quarter. Pricing is $100 each in quantities of 1000.

SwitchCore Corp., 675 N. First St., PH3, San Jose, CA 95112; (408) 995-3850; fax (408) 995-3858; e-mail: [email protected]; www.switchcore.com.