10-Gbit Ethernet Switching Processor Blasts Barriers To The Backbone

The incessant drive for faster, lower-cost connections to the Internet backbone has set the stage for a struggle between the local-area network (LAN) and the wide-area network (WAN) to achieve the heart and soul of the metropolitan-area network (MAN). On one side, the WAN wants to keep the installed base of OC-192-based protocols and systems intact. At the same time, enterprise networks and Internet service providers (ISPs) on the other side wish to leverage the high-volume, low-cost, well-known Ethernet technology that forms the basis of much of their infrastructure. Doing so allows them to migrate their Ethernet systems to 10-Gbit backbone connections at minimal cost, without having to drastically modify their existing network.

While the IEEE's High-Speed Study Group (HSSG) sorts through the myriad of political and technological issues involved in defining the physical links required, semiconductor manufacturers like Allayer Communications are hard at work readying silicon solutions for the back end. These ICs will provide the necessary network switching and processing that enterprises and ISPs are demanding. Such devices will allow the migration of 10/100-Mbit Ethernet to full 1-Gbit networks, which can then be connected directly to the Internet backbone over OC-192, or switched Ethernet over the MAN (at 10 Gbits/s). This will drastically improve ISP performance. It also will speed the delivery of popular, high-content multimedia web pages now being designed, while ensuring that the enterprise has a high-speed migration path.

The latest device from Allayer, the AL1032 switching processor, comprises 12 1-Gbit Ethernet ports with one 10-Gbit port. It's designed to connect Ethernet to the OC-192 optical backbone via the XGbitMII (XGMII), an already defined media-independent interface (Fig. 1). The AL1032 is the first device for multinode network switches to provide direct access to the Internet over an OC-192 optical connection. On-board flow-control ensures a steady 10-Gbit output, though in reality, the network traffic tends to be bursty. This means it isn't likely that all 12 ports will be at full throttle at any given moment.

Of course, the way that the AL1032 works sounds deceptively easy. Gigabit Ethernet is well defined, and the OC-192, 10-Gbit optical links to the backbone are readily available. It's just a matter of de-ciding which one to use. The problem, however, arises when multiple Gigabit Ethernet ports converge on a single chip to increase integration and, thus, save on cost and space. There, they must be processed, switched, and routed via a 10-Gbit port to the backbone.

The sheer enormity of the data coming in and going out has brought many chip designs to their knees, as a lot more than simply switching must take place. Packet processing occurs that requires packet classification via policy filters, packet identification, and tagging, while routing requires support for multicasting and virtual LANs. Other features—link aggregation and control, support of network management functions, and class-of-service queuing—must be provided, too. All of this has to be done at wire speed and in real time if latency issues are to be avoided.

To date, a number of approaches have been tried. But for various reasons, none has succeeded. For example, one shared-memory approach achieves the high-speed output, but the interconnections between chips that allow scaling to multiple ports aren't defined. Another approach on the market has the required modularity to scale the ports, but it can't provide the high-speed link. Allayer leverages an advanced memory technology and policy engine, along with proprietary hashing algorithms for the lookup table, in order to achieve the modularity and high speed that's required.

How Does It Do That? Data coming into the AL1032's 12 ports is immediately stored in a receiver FIFO. Then, the required packet information is sent to the parser registers, which extract the address and class-of-service fields and forward them to the lookup engine (Fig. 1, again). The lookup engine does a comparison to the internal, 32-kbyte, media-access-control (MAC) address table. That's loaded at power up and updated in real time as the chip processes data.

The address table and the virtual-LAN (VLAN) database determine the destination port for a given frame. Multicast MAC addresses, including IP multicasts, can also be stored and searched. The device supports both port-based and tagged (802.1q and 802.3ac) VLAN lookup. It also supports 4-kbyte VLAN addresses with the 802.1's multiple spanning-tree option as well as the 802.1v VLAN classification by protocol and port, with flexible and programmable ingress and egress checking rules for VLAN processing.

VLAN support is key because it allows groups of users to be programmed to exclude multicast traffic from other ports. This improves efficiency as the network doesn't have the overhead and security concerns of having to listen to the traffic intended for other users.

To perform address lookup in real time at these speeds, Allayer had to come up with extremely efficient hashing algorithms. According to David Wong, director of marketing at Allayer, "Everyone can put this kind of speed into a chip. The question is, 'How fast can you do the lookup?'" He goes on to explain that "the first generation of switches out there were all fast enough to switch. The problem was that they couldn't read the table fast enough."

Approaches to date have used an external content-addressable memory (CAM). Typically, that's fast because it avoids the processing that hashing algorithms require. Given a fast-enough algorithm, however, that reduces an address into a short-enough key (such as 16 bits versus 64 bits), and the tradeoff in processing time falls in favor of hashing. In addition, it's important to note that the algorithms are processed in hardware, using state machines to improve performance.

The searching and classification tables are shared among all ports in a round-robin fashion. Any header replacement is accomplished while the frame is still in the receiver FIFO. From the FIFO, the frames are sent through a serial-to-parallel converter to a 1-Mbyte shared-buffer memory for storage and switching.

The memory is wide enough to switch up to 33 million frames/s. The biggest problem with getting data in and out of memory at these speeds is that a wide memory bus is required. Because it isn't cost effective to use external memory, this must be done on-chip. According to Wong, "The available technology is either SRAM or DRAM. Unfortunately, the DRAM process has a low yield, especially for high-gate-count devices. For its part, SRAM can make use of standard-logic processes, but it comes as a four- or six-transistor cell. Four-transistor SRAM is smaller, but consumes more power, and vice-versa for six-transistor SRAM," he explains.

Taking these issues into consideration, Allayer chose to partner with Mosys, which came back with its one-transistor SRAM that Allayer embedded on-chip. One-transistor SRAM has the advantages of both DRAM and SRAM. With only one transistor, it's both small and cost effective, while it also has relatively low power consumption. Plus, it has been proven to work in such high-density designs as electronic gaming, namely Nintendo. The 1-Mbyte buffer is split into two halves: one half is given over to the 10-Gbit port, and the other half is shared among the 12 1-Gbit ports.

In order to support quality of service (QoS), each output port has four priority queues. Their assignments are based on L2 to L7 classification, the TOS/DiffServ DS field protocol, or the 802.1p priority field protocol. Each output port retrieves the frames from the shared-buffer based on queuing and sends them to the transmitting FIFO.

In L2 to L7, there are 128 freely programmable filters that can work on any field in the packet. Comprising up to 512 rules, the filters make it a rule-based classification scheme. Possible actions include drop, change destination, reassign priority or VLAN tag, and statistics gathering. Not one of these is easy at 10 Gbits/s. Furthermore, the classifier includes a CPU trap as certain protocol packets. This trap is for functions like address resolution or link aggregation (supported by the AL1032) that need to be handled by the CPU.

A key feature of the AL1032 is its ability to disable local switching independently for the receive and transmit channels. Possible permutations include local switching for the Gigabit Ethernet ports only; no local switching, with all data funneled into the 10-Gbit uplink; and local L2 switching for the 10-Gbit port.

By disabling switching altogether, the AL1032 can function solely as a high-end multiplexer in a backplane application. Security, too, can be enabled or disabled on every port. Each port's address can be preprogrammed or frozen so that only those addresses on the allowed list can access the network. An alternative is to disable a port upon detection of intruders.

For overall network management, the AL1032 collects all the management-information-based (MIB) statistics that are required for simple network management protocol (SNMP). Supported MIBs include Ethertype, Bridging, RMON and RMON II, as well as SMON.

As a whole, the device is initialized and configured by an off-chip CPU, which also is responsible for search and table updates, plus management functions. The CPU has a separate, 32-bit/66-MHz PCI port with its own transmit/receive FIFOs. Those also can be employed as a fourteenth port. Alternatively, the AL1032 has 4-kbyte EEPROM support for CPU-less operation in low-cost, standalone applications.

Currently, the XGMII-compatible uplink on the AL1032 is being defined by the IEEE's 802.3ae HSSG, al-though it has pretty much been decided upon. Of more concern are the flow-control methods required to support OC-192 (which runs at 9.6 Gbits/s). At present, there are two proposals—open loop and busy idle. To ensure compatibility, the AL1032 supports both throttling schemes. The 12-port side's compatibility ranges from the 10/100/1000 MII/GMII to the ten-bit interface (TBI).

When it comes to implementing the AL1032, a key feature is the 802.3ad port-aggregation support that was mentioned before. This allows the grouping of ports to logical fat pipes, with up to six trunks, each supporting up to 12 ports.

Up to 16 remote ports can be supported within an aggregation group. This provides plenty of options in terms of combining AL1032s for an optimum balance of performance versus flexibility. The flexibility and performance combination of the device make it a key enabler in the drive to get data off servers and networks and into the Internet backbone (Fig. 2).

The device uses a 0.18-µm CMOS process, runs off 3.3/1.8 V, and comes packaged in a 721-pin TBGA. With a power consumption in the 5- to 6-W range, it's dwarfed by the expected consumption of the physical layer.

Price & AvailabilityThe AL1032 is sampling now, and production quantities will be available in November. Pricing is $250 each per 1000-unit quantities.

Allayer Communications Inc., 107 Bonaventura Dr., San Jose, CA 95134; Contact Claus Stetter at (408) 570-0888; fax (408) 570-0880; e-mail: [email protected]; Internet: www.allayer.com.