Superfast CMOS SRAM Macros With Copper Interconnects Cut Access Times To 430 ps
When designers create a high-performance processor, the on-chip cache memory deserves a lot of attention. It doesn't matter how fast the datapath is. If the processor can't get the data from the cache, the device slows down. By employing an enhanced 0.18-µm bulk CMOS process with copper interconnects to reduce resistances, along with taking a pseudostatic logic approach to avoid delays due to precharging and evaluation timing issues, a CMOS memory array reaches speeds previously achievable only by gallium-arsenide technology.
The SRAM macro operates with a 2-GHz cycle time and provides a 430-ps access time, making it one of the fastest memory blocks yet incorporated into a CMOS processor chip. It was developed by IBM at its T.J. Watson Research Center, Yorktown Heights, N.Y., in conjunction with its System 390 Division in Poughkeepsie, N.Y.
As part of the cache memory, the SRAM macro actually serves as a directory memory in the L1 data cache. The directory memory typically holds a subset of the absolute addresses (main memory addresses) that correspond to the cache's instruction and data entries. Consequently, the performance of both the directory memory and the translation-look-aside buffer is important to achieve single-cycle directory lookup and set selection.
The directory memory array contains about 34 kbits of storage, configured in a four-way set-associative architecture. It's organized as 1024 entries. Each of the four sets represents 256 logical entries, totaling 8.5 kbits. All four sets share the same addressing and data-in circuits. By reducing the delays, all aspects of memory performance increase. Also, the pseudostatic logic used for the memory-array macro helps minimize a condition known as "floating nodes." This reduces the charge-sharing noise commonly found in dynamic logic circuits. The internal memory control signals are cleaner, and memory performance can be maximized.
As the address information wends its way through the memory array, it finally reaches the bit-decode circuits. The decode logic is implemented in dynamic logic to achieve the fastest decode time (see the figure). Once the decoder delivers the address, a reset-enable signal resets the bit decoder. The "true" output of the bit decoder is used to precharge the bitlines to the full supply voltage before a read operation takes place.
To minimize the time required for the decoder to perform its function, designers had to deal with several important timing constraints. For instance, there must be a large enough margin between the wordline selection and precharge mode. The bit-decode "true" signal must be high for a sufficient time, even after the wordline-enable signal goes away.
The timing between the wordline control enable (WCE) and the bit-decode true (BDT) signals is crucial to achieve fast read operation. The WCE and BDT signals are wire-NORed together to select the bitline for read or write operations. To ensure proper read operations, the WCE signal must go high before the falling edge of the BDC signal. If that doesn't happen, a glitch will propagate its way to the bitlines and cause improper data to appear.
Since one of the read/write signals, WCE, drives 120 bit-select circuits, the read/write drivers are skewed to ensure early arrival of the WCE signals (with respect to the bit-decode signal, BDEC) during read operations. During write operations, both WCE and BDEC signals are low. Furthermore, the true (word decode true enable, WDTE) and complementary (write data complement enable, WDCE) data values pull up and pull down the bitlines, respectively, to write into the static six-transistor memory cells that are selected by the wordline.
When implemented with IBM's 0.18-µm design rules, the 34-kbit directory memory macro requires an area of just 1.05 by 0.35 mm. Seven full levels of copper interconnect, plus tungsten for some local interconnect, are used to keep the cell size small and delays as short as possible. An included block of logic for array built-in self test (ABIST) greatly assists in testing the memory operation using the multiple patterns that the ABIST pattern generator can create.