Content-Addressable Memories Speed Up Network Traffic And More

Although content-addressable memories have been available for over a decade, their high cost and low capacities limited their use to highly specialized applications. Now, thanks to advances in processing technology and multilevel metallization, very dense, high-speed multimegabit CAMs can be economically fabricated. They're even tackling ever-more-demanding applications as features like ternary capability supplement standard binary decision capabilities.

CAM functionality also can be implemented using standard memories and a support circuit that emulates the CAM features. If capacity rather than speed is the key concern, CAMs can even be done entirely in software. This software-only approach, referred to as a virtual CAM, can be a means to reduce system cost. Alternative approaches to CAMs compete in the same application space. One such approach combines the use of standard SDRAMs and a search algorithm developed by using the company's functional protocol language (FPL).

CAMs still win out through the high acceptance they've gained in the networking and database arenas, however. They can accelerate any application that requires fast searches. Of course, many other application areas also can benefit. Image, voice, and pattern recognition are several such examples. When compared to software algorithms like binary or tree-based searches, CAMs can often deliver a tenfold or better reduction in search time. That disparity shrinks a little when the algorithms are run on latest-generation RISC engines that clock 500 MHz and faster. But even at those speeds, CAMs are still a 4X to 5X improvement over software searches.

CAM Mode Of Operation As offshoots of the basic static RAM, CAMs work in a sort of opposite manner than software. Rather than accept an address input that pinpoints a specific location and then deliver the data from that location, binary CAMs typically use the target data word as the input. They'll signal if a matching value is contained in any location. If the data is present, the chip sends out a flag signal that tells the system that it can proceed with the next operation—for example, forwarding a data packet to a node address that matched the CAM contents. If no match signal is received, the system might reload the chip with another block of data and try again. Or, it could pursue yet another search operation.

The organization of the CAM is radically different from the straightforward address-input, data-readout architecture of the SRAM. Data is stored somewhat randomly in a CAM. It could just be a simple table that's downloaded into the memory upon system startup. Or, it might be something more complex that could be adaptively updated during system operation.

An address bus can select the location that holds the desired data. Alternatively, the data could be written into the first available location. Two status bits, typically available on most CAMs, keep track of whether the data held in the location is valid. If it's no longer valid, it can be overwritten.

Once a CAM is "loaded," the desired match is found by first loading in the match value and holding it in a comparand register. Next, a simultaneous comparison takes place between the comparand and the values stored in all active locations. In traditional binary CAMs, this operation must find a 100% match with the stored data. If that occurs, the CAM chip's Match flag is asserted to let the system know that the desired data is matched (an Ethernet node address, for example).

Some system action can be taken based on the flag's assertion. For instance, packets can be routed to the node with the address that matched the table data held in the CAM. On-chip functions, such as a priority encoder, are included to help sort out which matching location has top priority if more than one match exists. The encoder then sends the address of the best match to the host.

The latest enhancement to the CAM is a ternary capability, in which one or more bit positions in the data word can be set as "don't care" values. The search operation can then find one or more "closest-match" candidates. In the network area, such a capability would allow the distribution of packets to multiple addresses on a LAN or to multiple sub-LANs.

Extra Transistors For CAMs Although CAMs are usually based on SRAM memory cells, the CAM cells are as much as 60% larger. A typical SRAM cell contains six transistors. Four form a cross-coupled latch and the other two control read and write activity. The basic CAM cell requires 10, with the extra four devices implementing the comparison function between the bit stored in the memory cell and that in the comparand register (effectively, an exclusive NOR operation). A ternary SRAM-based CAM cell could require up to 16 transistors. In contrast, a DRAM-based CAM cell can be implemented with just six transistors, making it a promising alternative to SRAM technology.

In the past, the larger cell area and the more complex interconnect needed on the chip limited CAM chips' capacity to densities of less than 100 kbits. But CAMs don't need external address lines to find the matching data, so array depth can be extended as much as desired by just adding more chips.

That's not true when trying to expand the width, though. The search must access all storage locations simultaneously, which makes it difficult to concatenate the match lines. Of course, CAMs do typically have very wide words. Their main application has been the matching of Ethernet network addresses, which are typically 48 bits. Additional protocol bits might also require matching, so typical word requirements include 64, 72, 96, 128, or 144 bits.

One alternative to widening the word by cascading chips is to program the internal word organization. Normally, a CAM chip would be designed that only, for example, contains 4096 48-bit words. But by setting a few register bits, that chip also could appear to be a 2048-word by 96-bit memory, or some intermediate value as determined by market need. Just such a variable-architecture device is available from Kawasaki LSI. In soft form, a similar capability is available on the NeoCore NeoCAM technology.

With today's deep-submicron processes and the ability to use four or more levels of metal interconnections, however, several megabits can be integrated. Up to 4 Mbits are possible today, and even larger versions are expected later this year. The use of dense, embedded DRAM cells, rather than SRAM cells, can even allow still higher capacities. So far, only MOSAID has unveiled a commercial 2-Mbit CAM design based on embedded DRAM. Although its capacity isn't larger than several SRAM-based devices, the DRAM-based chip is much smaller (and thus lower in cost) than comparable-density SRAM-based CAMs.

Compared to the general-purpose memory arena, a limited number of companies offer CAMs or CAM technology. Binary CAMs are available from Kawasaki LSI, Music Semiconductors, and NetLogic Microsystems, while UTMC Microelectronic Systems offers an engine-control chip that turns a bank of synchronous static memory or synchronous DRAM into a CAM. Ternary CAMs can be obtained from Lara Technologies, MOSAID, NetLogic, and SiberCore Technologies. A NeoCore software approach delivers a virtual CAM that can be implemented with software and standard system memory, or as an ASIC controlling a memory subsystem. Lastly, Agere's fast pattern-processor chip leverages the high off-chip density possible with standard SDRAMs.

In network applications, CAM size is often determined by the size of the router table. A small local-area-network workgroup might have fewer than 256 nodes, so it would only need a CAM that can hold up to 256 entries. The bigger the group or network, the larger the number of entries needed in the network routing table. Enterprise-class routers, for example, may require 64k or 256k table entries, while database search engines could require in the millions.

Word depths are offered at 2k, 4k, or 8k by the three members of the LANCAM 1st family of 64-bit-wide CAMs from Music. One unusual aspect of those chips is that users can partition the 64-bit-wide memory field into both CAM and RAM subfields on 16-bit boundaries. The contents of the memory then can be randomly accessed through a 16-bit I/O bus or associatively accessed by using a compare operation (Fig. 1).

In LAN applications, such as a bridge, the RAM subfield might typically hold port address and aging information. That data relates to the destination or source address information held in the CAM subfield of a given location. Or, in a translation application, the CAM field could hold the dictionary entries while the RAM field holds the translations. Such an arrangement would permit near-instantaneous responses.

Two active mask registers are associated with the CAM array at any point in time. They can be selected to mask comparisons or data writes. Register 1 has both a foreground and a background mode that allow it to support rapid context switching. The second register's contents can be shifted left or right one bit at a time.

The memory array can perform a comparison in 100 ns. That translates into a potential 10 million matches per second, without taking system overheads into account. In an actual system, the match throughput might only hit about half of the maximum value.

An I/O bus that's 32 bits and memory depths comprising 32 and 16 kwords can be found in two members of the LANCAM WL series, also available from Music. The wider bus allows higher-speed operation. The WL series runs twice as fast as the LANCAM 1st series.

Tackling the other end of the spectrum for applications that need just a little CAM, the MU9C3640L List-XL contains 256 words (64 bits), employs a 16-bit I/O bus, and can perform a compare operation in 90 ns. Offering twice the depth and a 70-ns comparison speed, the LANCAM family's MU9C5480A/L also can process both the destination and source address within 560 ns. That's equivalent to 111 10Base-T or 11 100Base-T Ethernet ports. Other LANCAM family members pack capacities of 512, 1024, 2048, and 4096 words.

The company created two specialty CAM-based chips, the MUAA and the MUAC routing coprocessors. The MUAA is targeted at processing MAC addresses in multiport switches and routers. It can handle up to 48 10/100, or four Gigabit Ethernet ports. It also can perform Layer 4 flow recognition for quality of service at rates of up to 16.7 million packets/s. Available with word depths of 2, 4, or 8 kwords, the 80-bit-wide on-chip CAM provides two interface ports: a bidirectional 32-bit processor port and a 32-bit synchronous port with separate inputs and outputs to speed data movement.

Targeting best prefix-match searches of IPv4 addresses, the MUAC is a binary- or ternary-capable device. It can handle 20 million IPv4 packets/s, which allows the chip to support up eight Gigabit Ethernet ports or a pair of OC-48 ATM ports at wirespeed. The DA and SA processing can be done in less than half the time of the MUAA—just 250 ns is needed. Internal comparison operations can run on a 32- or 64-bit word, with a 50-ns deterministic compare and output time possible. Up to seven mask registers are available to determine the best match results.

Others Targeting IPv4 Other new products are aiming at network IPv4 prefix matching, too. The UTCAM-Engine from UTMC controls an array of SDRAMs or synchronous SRAMs to form CAM systems capable of more than 50,000 entries (Fig. 2). The engine can be configured to support multiple tables, each with different widths and depths, within the attached memory. This ability to support multiple tables and association widths enables designers of Layer 2, 3, and 4 switches and routers to handle all address-processing requirements with one device.

Able to hit a speed of up to 100 MHz, the UTCAM-Engine can configure the memory as a single table or partition it into as many as 8191 uniquely configured tables. With such flexibility, designers can use any combination of key widths, association widths, and table sizes to optimize system performance. When performing exact matches, the engine can process 10 million packets/s. For 32-bit longest-prefix matches, the processor slows down to 4 million packets/s. The host interface allows table updates to be done in less than 1 ms. The control interface, on the other hand, accepts tables with sizes ranging from 256 to 64 million entries and key widths and association widths of up to 256 bits.

The engine brings along several features that give it some powerful search capabilities. A hierarchical search capability permits tables with different key lengths to be linked hierarchically and searched in sequence for the most significant bytes of the key. If a table overflow occurs, this capability also lets the tables be expanded dynamically.

A record-count feature can help an application determine whether a table is full and if the system must take corrective action. With its proximity feature, the engine can examine an entire table to identify the record that most closely resembles the key presented. Optionally, either the entire record (key and association pair) is returned, or just the association.

Taking quite a different approach, Kawasaki's longest-match engine controls an off-chip bank of EDO DRAM to hold the CAM tables and keys. Yet it can process over 4 million packets/s. Focused heavily on network applications, the LSI KE5BLME008 employs a triple-port architecture that allows internal multitasking. Updates, downloads, and match outputs occur with a minimal latency of just 420 ns.

With the three ports, the chip can input a match value, deliver a result, and use the host CPU port for control. The longest-match engine also is available as a block of intellectual property (a core) that can be used to craft a custom silicon solution. In addition to the engine, KLSI offers several CAMs that pack configurable CAM spaces. The designer can program these spaces to optimize the chip to configure to the system.

Staking a claim for some of the highest performance numbers to date, a 4-Mbit ternary CAM developed by NetLogic Microsystems can perform as many as 83 million address or data-comparison tasks per second. The NL877313 also packs a user-definable memory array that can be configured for data widths of 72, 144, or 288 bits. The wide entry words and large table depth promise to greatly improve the performance of address-table lookups in enterprise, campus, and Internet edge routers. Also referred to as the IPCAM-3, the ternary CAM wide-word mode (288 bits) permits it to readily handle version 6 packets of the Internet protocol (IPv6).

To achieve high performance, the IPCAM packs four buses: an instruction bus, a comparand bus, a results bus, and the new bus—the next-available free-address bus. Unlike the systems using basic ternary CAMs, those built around the IPCAM-3 can use the NFA bus to write data to off-chip memory while the system is writing to the CAM. Such simultaneous operation makes buffering the table data unnecessary, simplifying system design and improving throughput.

Two versions of it are available. One, the NL877213, boasts the 4-Mbit on-chip CAM space. The slightly larger NL77313 packs about 5 Mbits of storage. Samples of larger chips are expected next quarter. These chips supplement the company's IPCAM-2, a 1-Mbit ternary CAM that's configured as a 64-kword by 128-bit array. The chip can be used in large enterprise and campus router applications. With just one or two of the chips cascaded, the CM system could handle from 50,000 to 60,000 addresses.

Classless interdomain routing also became the goal of designers at NetLogic. Like Kawasaki, UTMC, and Music, they created an address-processor chip. The NL77542 packs 1 Mbit and is organized as 32 kwords by 40 bits. Table updates can be done at 66 million per second when the chip is clocked at 66 MHz. A slower version that hits 50 MHz is available as well.

Leveraging its know-how in using embedded DRAM, MOSAID Technologies has created a 2-Mbit configurable CAM, the DC2144-15T. As the first member of the Class-IC family, this CAM is capable of ternary operation. It can be used in applications such as classless interdomain routing, flow analysis, advanced virtual LAN support, and high-performance packet classification. Multiple masks support various search scenarios on a single database, while the memory configurability allows the circuit to handle different databases on the same chip.

DRAM: Space Vs. Cost Due to the use of DRAM technology, the chip area is considerably smaller than that of an SRAM-based CAM. It follows that it also will carry a significantly lower price. To deliver a sustained search throughput when running at a 66-MHz clock rate, the Class-IC chip employs a double-data-rate memory interface on the data load port. Because data can be transferred on both edges of the clock, the chip can perform searches with only half the number of pins required by competing solutions.

Associated with each CAM entry are a number of special bits that the memory uses to encode the type and validity of that entry. In this first-generation Class-IC family, bits for Empty, Skip, Permanent, and Age are available. Empty status is an obvious requirement for updating the table. Skip is important for managing press-allocated, but empty, locations in the CAM. It also allows the user to walk through a series of multiple matches.

Age is a single-bit indicator that's updated whenever there's a referral to an entry. By using this bit, the CAM management software knows which entries are "stale" and can purge them after a specified amount of time. The Permanent bit protects an entry against this purging due to the entry's age. The chip also handles Learning and Aging functions. That way, it can support Layer 2 bridging applications in switches.

Additional ternary CAMS are available from Lara Technologies, and SiberCore Technologies, two fairly new companies that are providing support for the network industry. Able to keep pace with the NetLogic chip at 83 million searches (either exact or longest prefix match), the LTI7010 can work with tables as large as 16,384 entries featuring entry widths as wide as 272 bits (Fig. 3). Multiple CAM chips can be cascaded to allow the construction of large tables. Up to 1 million entries are possible. Each entry is maskable on a bit-by-bit basis, permitting the user to store and compare ones, zeroes, and don't-cares in each location.

Wide-word capability—up to 256-bit words—also is offered by the SiberCAM family from SiberCore. The family has ternary longest-prefix-matching capability as well. The company claims the speed crown with its chips—up to 100 million sustained searches per second. Read and write operations don't steal search cycles, so the CAMs can deliver optimal performance.

One extra feature thrown into the SiberCAM is a low-power architecture that trims the power consumption, making it the lowest-power ternary CAM available to date. Initially, the company released two versions of its chip: one with 2 Mbits of on-chip memory, and the other with 8 Mbits of on-chip ternary CAM storage. Either can be configured in one of two different modes.

In the first, the chip is optimized for table lookup operations and employs three ports: a comparand input port, a search output port, and a non-intrusive table-management port. The second mode differs in that the chip is optimized for low pin count and the comparand and table-management ports are combined into a single port.

CAM Emulation The use of software to emulate a CAM is the kickoff development by researchers at NeoCore. The NeoCAM virtual engine can run on a host processor and turn a bank of SDRAM into a CAM of almost any desired depth. It provides all of the capabilities and performance acceleration aspects of a traditional CAM. Plus, it adds on-the-fly capability to adjust the key width and depth.

The company offers the virtual CAM engine as a software solution, but it's also used as an ASIC solution (a chip set). Both use the same application programming interface (API). That API, part of the software-development kit, can be used to model the virtualized technology.

An alternative to CAMs comes from another relatively new company, Agere. Its APP1200 fast pattern processor provides a programmable engine capable of handling 2.5-Gbit/s data streams and process over 6 million packets/s today. The company expects the roadmap it set up to lead to packet-processing speeds of beyond 50 million packets/s and wirespeed operation at up to 20 Gbits/s.

To achieve such performance levels, Agere developed a high-level functional-protocol language (FPL) that lets designers program the fast pattern processor. The processor can then recognize and classify incoming packets based on millions of data patterns. In FPL, instructions are coded like a protocol definition language, with interspersed action statements and the ability to embed routing table information directly into the protocol code.

The chip performs complex pattern or signature recognition, and operates on the packets or cells that contain those signatures. Programs written in FPL are compiled and loaded onto the host processor for execution. The code is optimized for packet processing. Programming in FPL is much simpler than with conventional languages, such as C++.

Manufacturers Of Content-Addressable Memories
Agere Inc. (512) 502-2800 www.agere.com Kawasaki LSI (408) 570-0555 www.klsi.com Lara Technologies Inc. (408) 519-500 www.laratech.com MOSAID Technologies Inc. (613) 599-9539 www.mosaid.com Music Semiconductors Inc. (408) 232-9060 www.music-ic.com	NeoCore Inc. (719) 576-9780 www.neocore.com NetLogic Microsystems Inc. (650) 961-6676 www.netlogicmicro.com SiberCore Technologies (613) 271-8100 www.sibercore.com UTMC Microelectronic Systems (719) 594-8000 www.utmc.com