Content-Addressable Memory Speeds Up Lossless Compression

DESIGN VIEW is the summary of the complete DESIGN SOLUTION contributed article, which begins on Page 2.

When software-implemented compression schemes run out of steam, give your CPU a break by employing these speedy hardware-based techniques.

Companies want to reduce their WAN bandwidth requirements to minimize cost and reduce network congestion. As a result of these forces, considerable interest is bubbling up in the area of data compression as a way to reduce the number of packets that must be transmitted across networks.

Two broad categories of compression are currently in use. In lossy compression, data is intentionally discarded. As a result, the decompression of the data doesn't exactly match the original data. This method will allow for higher compression ratios and is useful if the lost data has minimal effect on the decompressed output. Typical applications for lossy compression include audio and video compression.

Lossless compression is used for applications where the original data must be fully restored following decompression. Examples of applications requiring lossless compression include network data, medical and satellite imaging, and military traffic.

Until now, compression algorithms such as the Lempel-Ziv-Welch (LZW) have been implemented in software. This provided acceptable compression performance in many older systems. But with today's increasing numbers of connections, Web servers are now a bottleneck. CPU-intensive compression algorithms can no longer scale to meet next-generation requirements. Plus, as more and more content is served up dynamically, the CPU load increases significantly.

As with many such examples that predate compression, the limitations of software-based implementations are now creating an interest in the migration of this compression technology to hardware—offloading the CPU to perform higher-level functions. However, very little dedicated compression hardware currently exists. But content-addressable memories (CAMs), traditionally targeted at high-speed lookups for packet classification, can represent a key element in the acceleration of lossless compression.

HIGHLIGHTS:
Encoding And Decoding	The LZW algorithm encodes variable-length strings of symbols as single codes. Con-struction of the dictionary used to decode is similar to that of the encoding table. But, the tables are used differently. The encoder relies on content searching, while the decoder can be implemented using conventional memory.
CAMs	Content-addressable memories (CAMs) speed the searching of content data. In a conventional device, an address is supplied and data is read from the memory. In a CAM, data is presented to the CAM and a parallel search is performed for a match.
Implementation Examples	A basic approach yields an average input bit rate of 78 Mbits/s. An enhanced, CAM-optimized scheme boosts the average input rate by 4×

Full article begins on Page 2

IT professionals continually seek ways to postpone infrastructure upgrades, looking instead to maximize the efficiency of their existing networks. In many cases, they’re further tasked with deploying additional services without increasing total available bandwidth. That last mile continues to be a significant bottleneck for content providers. Companies want to reduce their WAN bandwidth requirements to minimize cost and reduce network congestion. As a result of these forces, considerable interest is bubbling up in the area of data compression as a way to reduce the number of packets that must be transmitted across networks.

Data compression is the generally well understood process of identifying patterns within a data stream and replacing these repetitive segments with shorter symbols, thereby reducing the size of the original data stream. Decompression reverses this process, converting the symbols back to their original form.

Two broad categories of compression are currently in use. In lossy compression, data is intentionally discarded. As a result, the decompression of the data doesn’t exactly match the original data. Lossy compression allows for higher compression ratios and is useful for data where the lost data has minimal effect on the decompressed output. Typical applications for lossy compression include audio and video compression.

Until now, compression algorithms such as the Lempel-Ziv-Welch (LZW) have been implemented in software. This provided acceptable compression performance in many older systems. But with today’s increasing numbers of connections, Web servers are now a bottleneck. CPU-intensive compression algorithms can no longer scale to meet next-generation requirements. Plus, as more and more content is served up dynamically, the CPU load increases significantly.

As with many such examples that predate it, the limitations of software-based implementations are now creating an interest in the migration of this compression technology to hardware—offloading the CPU to perform higher-level functions. However, very little dedicated compression hardware currently exists. Content-addressable-memory (CAM) technology, traditionally targeted at high-speed lookups for packet classification, represents a key element in the acceleration of lossless compression.

Compression Techniques Lossless compression techniques have been available since the early 1950s. In 1952, David Huffman introduced Huffman coding, a technique based on a coding tree derived from a statistical model of the source. With Huffman coding, statistical algorithms calculate the frequency and conditional probability of data, coding the data by giving the shortest code to those elements with the highest probability of occurrence. Typically non-adaptive, these techniques are best suited to specific applications where the data is relatively consistent and predictable. Unfortunately, network traffic is neither consistent nor predictable, so Huffman coding isn’t a good option for modern communication infrastructures.

Dictionary-based schemes copy repetitive or redundant data into a lookup table (such as a CAM) and output the dictionary address as a code to replace the data. The Lempel-Ziv algorithm is an example of an encoded dictionary that replaces a continuous stream of characters or symbols with codes. Because the relationship between a code and the original symbol varies as the data varies, this approach is more adaptive to variability in the source data, making it a much better fit for internetworking applications. Larger dictionaries can be used to optimize the compression ratios. But this must be carefully balanced against performance degradation arising from limitations in the selected dictionary lookup implementation and code-word length increase.

LZW Encoding The LZW algorithm encodes variable length strings of symbols as single codes (see the Code Listing). It’s based on a progressive, grammatical analysis of incoming symbols. The coding table, or dictionary, stores new vocabulary as it’s encountered. Therefore, the encoder input is a sequence of symbols (x₀x₁x₂x₃) and the output is a sequence of indexes or code words (i₀i₁i₂i₃). These indexes correspond to positions in the dictionary (addresses in the memory).

Each time a new symbol string is presented to the encoder, it gets added to the dictionary. At system initialization, the dictionary contains a source alphabet, typically the 256 possible combinations of 8-bit symbols plus some additional control characters. When a string is presented, the encoder concatenates the first symbol with the second symbol and performs a lookup. If a match is found, the encoder then concatenates a third symbol and performs another lookup. This continues until the concatenation of subsequent symbols results in no match. Then the index corresponding to the longest existing string is transmitted and the new string is written to the dictionary.

LZW Decoding Construction of the dictionary used to decode is similar to construction of the encoding table. Once the table is created, the decoder gets the index sequence as its input and produces the corresponding source symbol string as output. Each time an index is received, the decoder examines its dictionary and translates the index into a sequence of source symbols. The decoder performs no parsing or search operations.

Although similar, the tables are used differently by the encoder and decoder. The encoder implements symbol strings as input. The dictionary is searched for the symbol string and, if a match is found, an index or address corresponding to the location of the symbol string is returned. The decoder uses indexes as input and the content of the dictionary at the location specified by the index is returned to reconstruct the original string. As such, the encoder relies on content searching where the decoder dictionary can be implemented using conventional memory techniques.

Content-Addressable Memories Content-addressable memories are devices that accelerate the searching of content data. In a conventional memory device, an address is supplied and data is read from the memory. In a CAM, the opposite is true. Data (a search key) is presented to the CAM and, in a single clock cycle, a parallel search of the memory array is performed. The device then returns a match address or index that corresponds to the physical location of the matched entry. Typically, in networking applications, the match address is then used to perform a read in an associated data table. However, the ternary masking ability (the ability to store 0, 1, and a "don’t care" state) of modern CAMs makes them extremely well suited for compression applications as well. Traditional implementations of the LZW algorithm use the following tables:

Prefix Table: stores the prefixes of the strings in the dictionary. Entries for single symbols are empty.
Appended Symbol Table: stores the last symbol of each string. For single-symbol strings, the symbol itself is stored
Index Table: supports hashing techniques for the search operation.

Implementing LZW Compression Using CAMs When implementing the LZW algorithm using CAMs, the search function is performed in parallel in hardware. The strings can be stored in sequential order and in contiguous regions without any adverse effect on search performance. Obviously, the requirement for the index table disappears in this implementation.

Data contained in the prefix table and appended symbol table are stored as entries in the CAM. The corresponding data from each table is concatenated to form a unique table comprising pairs of prefix indexes with appended symbols (Fig. 1).

It takes 16 bits to encode an entry location (prefix address) or a table size of 64K locations, and an 8-bit value is used to represent the appended symbol. Each index/symbol pair can therefore be stored using 24 bits. Most modern ternary CAMs support entries with widths that are multiples of 72 bits. So, if the CAM word width is set to 144 bits, one CAM location can store five pairs, with the unused bits set aside for management functions like dictionary maintenance. Using a commercially available 9-Mbit device, up to 320K entries can be stored in a single devicesingle device.

When the dictionary reaches maximum capacity, different approaches can be adopted to manage the situation. In the most severe case, the entire table can be purged, typically in a few clock cycles. In situations requiring a more sophisticated approach, the management bits can be used to support multistate purging to optimize this maintenance function.

Using this basic approach shows that an average input bit rate of 78 Mbits/s can be achieved on the Calgary Compression Corpus files. The most referenced corpus in the data compression field for text compression, the Calgary Compression Corpus is the de facto standard for lossless compression evaluation.

High-Speed LZW Implementation Using CAMs The previous implementation stores strings as a prefix string/appended symbol pair. While this approach is optimized from the perspective of storage efficiency, it doesn’t fully realize the benefits of CAM-based compression. Because this storage scheme requires a search for each input symbol, the number of searches needed ultimately limits overall system performance. High-speed implementations, such as the one discussed below, begin to realize all benefits of modern ternary CAM technology.

To reduce the number of searches required and increase performance, the strings can be stored explicitly. This results in some redundancy of storage, but dramatically lowers the number of searches required. This, in turn, increases the number of symbols that can be processed in a given time.

In this implementation, the algorithm takes advantage of the ability of CAMs to perform a parallel search and return the longest match. It further benefits from the CAMs’ ability to resolve potential multiple matches, as one might expect to see in an explicitly stored dictionary. To ensure that the longest match is returned, entries in the table are stored in reverse order, with the first entry stored at the last address. Subsequently, addresses are decremented as entries are written to the table. To ensure the indexes’ output by the encoder can be made compatible with traditional LZW implementations (increasing order), the encoder transmits the difference between the maximum address (0xFFFF) and the match address.

Modern CAMs can store entries of various widths. In many cases, 288 bits or even wider word widths are possible. For simplicity, the implementation presented will be limited to 144 bits (18 bytes), although the reader may wish to evaluate different dictionary widths for specific applications. The strings are stored as a succession of 8-bit symbols and "don’t care" bits. During search operations, a void mask is used to look up the whole string of entry bits. An associated data array containing the length of the entries (1 to 18 bytes) is stored externally.

Figures 2a and 2b detail the implementation of the encoding algorithm. First, the CAM must be initialized to operate in 144-bit word-width mode with a 72-bit I/O port operating in double-data-rate mode. The wide I/O port is necessary to let the controller launch a 144-bit search each clock cycle. Mask register 0 is initialized as a void mask, ensuring that no bits in the search key are masked when performing search operations. Mask register 1 is used to write the first 256 8-bit symbols to the CAM. The CAM’s memory is used in reverse order. The multiple-match with highest-priority address output capability, and the prefix property of the LZW algorithm, ensures that the longest matching strings are returned. The variable String always contains the next 18 bytes to be encoded, with low-order bytes as the most recent. String may, at the end of the encoding process, contain less than 18 bytes. To simplify the flow chart, the last bytes of the input string are shown as transmitted in plain text.

The result of a search operation is always a match. At the very least, the first byte matches the initialization area. The highest-priority match address represents the longest string matching the input data. The length of the matching entry is deduced by retrieving the data associated with the entry in the associated data memory. This architecture is preferred as it avoids the requirement of a read from CAM memory, thereby maximizing performance.

If the matched length is equal to the maximum allowed length (18 bytes), then the match address is transmitted on Index_precision bits and no update of the dictionary is performed. Where the match length is less than the maximum permitted length, a new entry consisting of the (Length+1) first symbols (high-order bytes) of String is added to the dictionary. Mask register 1 is used for the write, and all bytes beyond the (Length+1) high order bytes are set to "don’t care."

In all cases, String is shifted by the matching Length bytes to the left, allowing new bytes to be input on the right. In each step of the algorithm (one loop), Length bytes are encoded. This implies a minimum of 1 byte per search and a maximum of 18 bytes per search.

Maintenance routines in this implementation are similar to those presented earlier. Each time the number of entries in the table (0xFFFF-Address Register) reaches a power of two boundary, the precision of the indexes is augmented by one bit. An external address register is maintained, with the address register starting at the maximum value (0xFFFF) and decrementing to 0, indicating the memory is full.

Two entries in the CAM must be reserved for the End Of String (EOS) and Flush codes. The first index is encoded at the end of the input string (file). The second index notifies the encoder that a flush has occurred. Modern CAMs provide mechanisms to protect static entries such as these control codes, preventing them from being purged or erroneously included as data in a search.

Due to the nature of the LZW algorithm, the input bit rates aren’t constant, varying according to the nature of the messages being coded (statistics of the source). The table shows the average results for both the achieved input bit rates and the compression ratio on the Calgary Compression Corpus files. The dictionary isn’t flushed when it’s full; rather it’s used for the remaining encoding of the file. A 64K entry dictionary was used for this implementation.

The average observed bit rate using this implementation is approximately 315 Mbits/s and represents an increase in performance of approximately 400% over the previous implementation. Using look-ahead search techniques to effectively pipeline searches can further improve overall performance by 30% or more.

Implementing more-advanced table maintenance and full-state handling can enhance performance even more. By monitoring compression ratios, complete dictionary flushes can be replaced by partial flushes. Special status bits included in some CAMs can be used to create multiple classes of entry within the dictionary, increasing flexibility for dictionary maintenance.

The flexibility and performance available in modern CAM technology allows for high-performance compression by leveraging the parallel search capabilities inherent to this technology. Software has long been the mainstay of the compression world, and compression speeds of up to 50 to 75 Mbits/s can be achieved using a dedicated processor.

The first implementation discussed matches the performance of the best software solutions at a significantly reduced cost. The second solution provides a four- to five-fold increase in performance. Further optimization offers the promise of even higher performance. In addition, higher-performance CAMs are also becoming commercially available. As performance is a function of search speed, data compression input rates will scale along with them.

Using CAMs in data compression isn’t limited to the LZW technique. Implementations of the Huffman Decoding algorithm, the LZ787 (Gzip) algorithm, and lossless image compression techniques can be ported to CAM implementations to realize the benefits of parallel search technology.