Embedded Memories Are The Key To Unleashing The Power Of SoC Designs

The semiconductor industry continues to validate Gordon Moore's original prediction that device densities and speeds will double every 18 months. With process technologies pushing well into the deep-submicron arena, they have finally reached a point where IC designers can integrate significant densities of memory and logic together on the same chip. In doing so, they have ushered in the system-on-a-chip (SoC) era.

Integrating memory on-chip isn't a new concept, of course. Microcontroller designers have done it for years. But the emergence of multimillion-gate ASIC designs has increased the demand for a wide range of embedded memory options and applications. Over the past few years, the prospects for embedding DRAM and flash memory on-chip have received a great deal of attention. But developing a single process to maximize the performance of both memory and logic circuits has been an ongoing struggle.

In the highly anticipated embedded DRAM arena, for instance, developers have faced an inherent contradiction between the demand to maximize device and interconnect performance for logic circuits, and the need to maximize retention time and reduce cost for DRAM circuits. Integrating the two technologies into a single process capable of eliminating the expensive additional mask layers has proven more difficult than expected.

IC designers haven't faced the same process constraints with embedded SRAM. Used for many years to accelerate performance in high-end network routers and switches, embedded SRAM doesn't require extra masking steps. This is because it's based on the same process used in logic designs. Moreover, while embedded SRAM employs a larger cell size than DRAM, new technologies are emerging to help boost embedded SRAM density. So, despite recent advances in the development of embedded DRAM and flash memory processes, embedded SRAM remains the workhorse of ASIC memory designs.

The key to embedded SRAM performance is memory compiler design. As process technologies have matured from one generation to the next, though, compiler designers have faced unprecedented challenges. A memory compiler works on the basic principle that memory has a regular structure. Memories are built from four basic building blocks: the memory array, predecoder, decoder, and the column select and I/O section. The memory array is constructed by using the same memory core cell (Fig. 1). The other three building blocks are also erected from a basic leaf cell. A compiler creates a memory design by using instances of the different leaf-cell types to make up the desired memory width and depth.

To address the ever-increasing demands of ASIC designs, memory compiler developers must constantly strive to improve density, performance, and power as technology moves from one generation to the next. Top performance in all of these areas is achieved when the leaf cell and memory core cell are optimized for both process technology and memory-size range. For example, LSI Logic Corp. has developed SRAM compilers optimized for different memory-size ranges and memory core cells that combine the highest driving capability and the smallest size for G12 0.18-µm technology.

Over time, new challenges have also forced compiler architects to adapt. Compiler designs are now optimized to meet the demands for a wide range of applications. Segmented or block architectures are deployed to improve performance and power consumption. SoC cores are designed with tightly coupled memories to overcome the processor-to-memory bottleneck.

As more complex system functions were integrated onto a single chip, the memory compiler had to evolve to embrace more memory subsystem features as well. Today's embedded memory designs often feature multiport, synchronous or asynchronous operation, and stringent power control (Fig. 2).

Perhaps, though, the greatest challenge facing the embedded SRAM developer is how to satisfy the growing demand to embed ever-larger memories on-chip. Over the past few years, the amount of embedded memory available to ASIC designers has rapidly grown from 1 Mbit in a 0.35-µm process technology to 2.5 Mbits in 0.25-µm processes, and more recently, to 6- to 8-Mbits in 0.18-µm technology. That, in turn, has dramatically complicated the test process. To fulfill those requirements, current embedded memories typically feature built-in scan latches and a scan path, as well as a built-in self-test (BIST) logic wrapper to perform self test.

Maintaining reasonable yields is critical to the development of cost-effective embedded memory designs. Accordingly, many ASIC manufacturers have gone a step beyond and integrated redundant rows and columns into their memory structures.

Some employ soft built-in self-repair (BISR) schemes in which the device identifies a bad row in a self-diagnostic routine and uses address mapping logic to automatically translate it to a good address space. While soft BISR strategies can improve yield and reduce cost, they present some significant limitations.

Usually, a soft BISR solution can add up to 1.5 ns to address setup time, posing a significant liability for high-performance designs. Developers must be aware of this liability and restrict soft BISR use to applications that can tolerate this additional time penalty. Plus, repair in soft BISR is a function of the power-on condition. Repeatability also is an issue. Finally, designers must compensate for power-on time. A soft BISR solution takes approximately 2 ms to run BIST, identify faulty rows, and repair them before the memory can be used.

Recently, some ASIC vendors migrated to a more sophisticated hard BISR concept. This is similar to those employed in standard DRAM parts using a fuse link and laser system to implement repair in manufacturing. In these schemes, an algorithm automatically burns in a field that directs the fuse box to connect to a good row address as soon as the bad address arrives (Fig. 3). In a hard BISR scheme, the fuse register output is linked to the scan output of a faulty location analysis and repair execution (FLARE) unit. The only time enable is high is when the fuse data is read into the FLARE at power-up. A BISR operation mode loads information directly from the fuse bank into the FLARE register, so remapping can take place without rerunning the BIST.

Hard BISR promises to eliminate the performance penalty and repeatability issues that designers must grapple with in a soft BISR approach. But, it requires more design and engineering resources than a soft BISR solution. Furthermore, it demands that designers pay a slightly higher silicon area overhead due to employment of the fuse bank.

Moreover, as ASICs integrate ever-growing amounts of memory on-chip, they become increasingly susceptible to soft error rate (SER), which is caused by exposure to alpha particles and cosmic rays. That, in turn, has driven the integration of error correction circuity (ECC) into memory designs.

Another way to mitigate some of the SoC embedded memory demands is to embed system functions into memories, rather than embedding memories into system functions. An excellent example is the content addressable memory (CAM), a proven technology that's historically been implemented in ASIC/custom design. Today, it's economically feasible as a standard product.

Used extensively to handle packet-classification and policy-enforcement tasks in data networking applications, CAMs utilize a search-table concept to provide a higher-performance alternative to software-based searching algorithms. The architecture is based on the idea of associating a mask word with each CAM word, allowing the user to mask entries on a per-bit basis. To add an auto lookup function into memory, a CAM builds an exclusive-or (EOR) function into a standard SRAM (Fig. 4).

Data in a CAM is stored in a random fashion. It can be stored at a specific address similar to RAM, in the next empty location, or written over a location with invalid information. Each location has an Empty flag bit and a valid bit to facilitate storing information into those locations. Once the CAM is loaded with application information, data can be found by comparing every memory bit with every data bit in the Comparand Register. This is made possible by the built-in EOR function of the memory cell. All bits in a stored location are connected to a Match line. If every bit is a match, then the Match line won't be pulled down. A Match flag bit is set to indicate that the information is found in the device. The search time is deterministic because a CAM searches all locations in one cycle.

To maximize an ASIC designer's flexibility and support the use of product-specific features and configurations, custom CAM compiler technology is currently available in today's market. That same technology also helps reduce development time and enable first-pass success. One family of CAM cores, the HSTLB, supports applications requiring smaller CAM densities of less than 256 entries and 72-bit word widths, while a second, the HDCAM, addresses networking applications that need larger CAMs in the 4-kword by 68-bit range.

Because custom compiler technology offers more flexibility, a number of unique features and functions can be added to support high-speed data search applications. For instance, these cores can be partitioned into CAM and SRAM, or associative and associated data fields. The intrinsic lookup capability of the CAM provides the address pointer to the SRAM. The associative data in the SRAM is accessible on the next instruction cycle for read or modify. This matches the requirements of networking system address attributes (associative data) stored with the network address.

Attributes may include protocol identifiers, port identifiers, or static/dynamic entry marker bits, among others. This quick lookup and subsequent translation capability enables the ASIC designer to use the CAM for protocol translation and header processing applications. In a typical application the designer could use this capability to convert Ethernet addresses to IP addresses, or to make the conversion the other way around.

Masking enhances CAM's versatile search capability. Now, CAM custom compilers come with both a bit-masking (ternary CAM) and a column-masking capability. A column enable field provides additional search control by column-masking the data CAM in byte-wide fields. Masking allows easy hierarchical, group, or subnet lookups, and it supports fast purging or modifying of list content during list maintenance.

A unique Format Control CAM field (Format CAM) lets the device simultaneously store different CAM word types in the CAM. A CAM with four 1-kword quadrants and a 4-bit-wide Format CAM can simultaneously store 64 different CAM word types in the 4-kword CAM. This flexible search capability allows the designer to mix many different search data in the CAM. Some common examples include the ability to store double-wide words along with single-wide words, store and search various address header subfields like source and destination address or ports, and mix protocols such as ATM and IP.

Another key capability in networking applications is multiple-match prioritization. In this situation, the CAM automatically prioritizes the entire 1-kword quadrant search space and returns the highest priority matching entry. A parallel search of all quadrant locations identifies priority with a binary-coded output called ENCA. The CAM assigns an internal status bit or "valid bit" for each of the words to determine which entries are valid. But the valid bit must be set to enable a successful match. The set operation is automatically registered on a load command and may be reset through the reset command or the unload command.

To accelerate processing of multiple matches, a NextMatch mode resets the current highest priority matching entry, completes a reprioritization, and returns the next highest matching entry within a single cycle. This mode efficiently identifies the multiple matches without the need for successive compare, unload, and re-compare operations.

A next available output (AVAL) feature helps facilitate the management of data storage within the CAM. By prioritizing and encoding all valid bits, AVAL always provides the highest priority unused address location that's available for data storage.

To minimize power consumption in large CAMs, the HDCAM architecture also is partitioned into quadrants, and a pipelined data flow is imposed on one quadrant. This enables ASIC designers to use internal circuitry to automatically stop the clock for inactive quadrants.

These unique capabilities can play a major role in the successful development of an SoC product. Recently, one of LSI Logic's customers sought to integrate a 4-kword by 68-bit CAM with special purpose hardware to control the associative search process in a network router application. The customer chose to integrate the hdcm4k68 HDCAM memory block into a device fabricated in LSI's G10 0.35-µm process.

The ASIC combined 180 kgates of logic, two single-port memories (one 512 words by 32 bits and the other 1 kword by 32 bits) and a 64-way by 24-word by 30-bit HSTLB block. The hdcmkx68 HDCAM occupied 5.8 million of the device's total 7.1 million transistors. The customer opted for a VG56 PBGA+ high-performance, low-inductance package to handle the thermal and power effects generated by the hdcm4k68's 550-mA average current requirements.

The real benefit of using custom CAM compiler technology comes when a designer needs to meet unique data search requirements. In this case, the customer was able to take advantage of bit- and column-masking, a double-wide search mode, and the successive processing of multiple matching entries to boost performance. Moreover, the HDCAM's quadrant architecture gave the customer the opportunity to implement smart power management. By partitioning the 4-kword CAM space into 1-kword quadrants, the designers were able to power down three-fourths of the CAM space in each clock cycle, significantly reducing power consumption.

More Logic Functions To Come The next step is to extend the CAM concept by adding more logic (system) functions to memories. A typical example is adding a logic function, such as compare, to a CAM. That type of device would support a fully associative search on user-defined fields over logic functions.

A more radical extension of the embedded-memory concept is to integrate the processor function into memory. This type of architecture promises to dramatically alter traditional concepts of memory as a passive element. Instead of reading out only what has been written into it, memory with a processor function (or collaborative memory) will be able to modify data when it's accessed through a process. An excellent example of a potential collaborative memory function is data compression and decompression. Many mechanical storage devices, like tape or disk drives, now compress or decompress stored data on-the-fly.

Designers can expect memories in the near future to embed even more complex system functions. One of the hottest topics in memory research is the development of intelligent devices capable of memory-assisted computation. Researchers are already laying the groundwork for devices that will essentially eliminate the processor-to-memory bottleneck in today's systems by merging both functions on-chip in a much more fundamental manner than today's embedded memory devices. These new devices should eliminate the processor-to-memory performance gap, provide a better building block for parallel processing, and more efficiently utilize the tremendous number of transistors available on a single chip.

Enabling this new generation of intelligent memories will be continual gains in device density as process technology tracks Moore's Law. Some researchers predict devices capable of offering 1 Gbit of on-chip memory, which will support internal bandwidths approaching 100 Gbits/s. This dramatic improvement in memory/system bandwidth promises to not only revolutionize system performance, but also to open the door to the rapid development of truly reconfigurable systems.