Overcome Traditional Memory-Speed Barriers With Embedded DRAM

For applications where performance is of primary importance, designers have traditionally chosen SRAM technology over DRAM. Although commodity DRAM offers much higher density and a lower cost per bit, it has been a slower memory technology than SRAM. In the past, many applications were prepared to pay a significant premium for SRAM performance.

With the availability of embedded DRAM processes, it's now possible to achieve performance levels approaching those of SRAM while retaining significant density advantages. Designers no longer need to choose between density or speed. An embedded DRAM macrocell architected for performance can achieve random-access cycle times beyond 200 MHz, with a bit density five to 10 times greater than SRAM.

While a DRAM bit cell requires only a single transistor and capacitor, an SRAM cell uses six n- and p-channel transistors, resulting in a 10:1 density advantage for DRAM. SRAM has a fundamental speed advantage because the 6T cell can drive a signal toward the output. A DRAM cell, in contrast, is a passive circuit that must be carefully sensed before a logic level can be passed on to successive stages of logic.

Figure 1 shows the basic circuitry at the core of a DRAM array. Individual memory cells consist of an n-channel transistor controlled by a wordline (WL), which selectively connects a bit storage capacitor to a bitline (BL). A logic 0 or a logic 1 is stored as V_SS or V_DD levels on the storage capacitor.

During quiescent periods, the bitlines are held at a midrail potential or V_DD/2. When one of the wordlines is enabled at the beginning of a memory access, the voltage stored in the accessed cell capacitor is attenuated by the ratio of cell capacitance to bitline capacitance.

Since the bitline capacitance is roughly an order of magnitude greater than the cell capacitance, the data magnitude on the bitlines is only a few hundred millivolts above or below V_DD/2. For this reason, a bitline sense amplifier is needed to amplify the bitline voltage to a full-rail signal. The signal path after the bitline sense amplifier is very much like an SRAM. This requirement to perform bitline sensing to amplify a small signal from a passive bit cell accounts for the fundamental difference between DRAM and SRAM access-time performance.

To complete the DRAM read cycle, the data must be written back to the memory cell. Unlike SRAM, a DRAM read is destructive. The bitline sense amplifier is a regenerative latch that amplifies the bitline potential to full rail. To restore this level on the memory-cell capacitor, the n-channel access transistor must be fully turned on. To store a V_DD level in the memory cell, the wordline must be raised to a potential greater than V_DD+ V_T, where V_T is the transistor threshold voltage.

In a modern DRAM, on-chip charge pumps and level shifters generate the required wordline voltages. DRAM devices must be able to withstand voltages higher than V_DD, though, resulting in larger device dimensions and poorer transistor performance. For cost reasons, commodity DRAMs have only a single type of transistor, which must be used throughout the critical path. For a given process generation, commodity SRAM will have a performance advantage over commodity DRAM, because SRAM circuits do not have to deal with voltages higher than V_DD.

Embedded DRAM Embedded DRAM processes now combine high-performance logic devices and interconnects with high-density DRAM cell structures for full system-on-a-chip (SoC) integration. A memory-cell transistor capable of withstanding wordline voltage above V_DD and a logic transistor equal in performance to standard logic processes are both available on the same piece of silicon. With the thick-gate-oxide transistor employed only in the memory cell and a few selected areas in the wordline decoder, the rest of the DRAM critical path can make use of higher-speed thin-gate-oxide logic devices to boost DRAM performance.

Some of the speed differential between SRAM and DRAM is a result of the interface bottleneck, rather than any fundamental technology limitations. Standard SDRAM can achieve page-mode operation of 100 to 133 MHz. But this mode of operation can only be exploited when large chunks of data can be organized on the same page. Many applications, such as networking, require successive memory accesses to widely dispersed addresses. In this mode of operation, the row-cycle time (t_RC) is the limiting factor.

Figure 2 shows the selected signals at the pins of an SDRAM: the clock input (CLK), command bus, and DQ bidirectional data I/Os. The command bus consists of row address select (RAS—), column address select (CAS—), write enable (WE—), and all the addresses sampled on the rising edge of CLK. Also shown are several key internal signals in the selected memory array, including the bitline precharge control signal (PRE), the accessed worldline (WL), and a differential bitline pair (BL/BL—).

Several commands must be delivered to the memory to execute a complete read instruction. An activate command accompanied by a row address turns off precharge in a selected array and raises the addressed wordline. There must not be any overlap between the precharge and wordline signals. Otherwise, the data from the memory cell would be lost through the equalization-device short circuit from BL and BL—.

With the wordline active, the cell transistor dumps its charge to the bitline, which can be seen on the timing diagram as a small attenuated signal. The bitline sense amplifiers for the selected array are then enabled to sense a full page of data.

Even if the full page of data is not required by the user, every sense amplifier in the selected array must be activated to restore cell data. The internal operations to activate an internal array and sense a page of data can take 15 to 20 ns. During this time, no further commands can be issued to the memory bank in question.

After several clock cycles, a read command accompanied by a column address selects one word from the page and sends it to the output. In this example, the read data is available on the DQ output 1.5 clock periods after the rising edge of the clock on which the read command was recognized by the memory. From the user's perspective, the memory read data could be latched by the memory controller device on the second rising clock edge after the read instruction was issued. In SDRAM terminology, this is known as "CAS latency 2." Internally, this read access shows up as a slight attenuation of the bitline potentials, as a databus is momentarily connected to the bitline pair selected by the column address.

To complete the SDRAM read cycle, a precharge command is entered on the clock edge immediately following the read command. This command turns off the wordline in the selected array to store full-level data in the memory cells. Then, it precharges the bitlines to V_DD/2 by shorting them together in preparation for the next active cycle. The precharge typically consumes two additional clock cycles to allow the bitlines to settle to within 10 mV or so. The entire read operation takes five clock cycles.

Commodity SDRAM row-cycle time is in the 35- to 40-ns range, allowing only 25- to 30-MHz random-access operation. Although commodity SDRAM may be capable of 133-MHz operation in the page mode, this level of performance cannot be realized in applications requiring random access (see "Commodity SDRAM Page Mode," p. 94)

Optimized Embedded DRAM With embedded DRAM processes, there is an opportunity to radically improve random cycle time and displace SRAM in many applications. In general, DRAM memory arrays are made as large as possible to minimize die area and achieve the lowest cost. The number of memory cells associated with each bitline pair in a typical commodity DRAM array is usually 256 or 512, as determined by the cell-capacitance to bitline-capacitance ratio, to provide an adequate signal for sensing (Fig. 3).

In commodity DRAMs, there typically are 2048 to 4096 bitline pairs in an array, minimizing the overhead of the WL driver. Even though wordlines are strapped in metal, the RC time constant can be more than 5 ns. As a result, several dead clock cycles have been required between the activation and read commands in SDRAM to allow for the wordline rise time and cell signal propagation to the sense amplifier. Additional cycles are required at the end of the cycle between the precharge command and a subsequent activation command to allow for the RC delay associated with the wordline falling edge and bitline equalization.

As mentioned earlier, today's embedded DRAM processes provide high-performance logic devices (in addition to the slower DRAM transistors) that can be used in all DRAM circuits with the exception of the memory cell itself. This greatly speeds up the datapath. With only a modest increase in area, the arrays can be fragmented to shorten bitlines and wordlines, substantially reducing RC delays.

Figure 4 shows a speed-optimized DRAM array structure that uses sub-wordline decoders to achieve shorter wordlines and reduced wordline RC delay. Simply cutting the length of the wordline in half reduces both resistance and capacitance by a factor of two, resulting in a fourfold improvement in RC delay. Reducing the length of the bitlines provides a similar improvement in bitline RC delay, plus the added benefit of an increased signal to the sense amplifiers, which further improves performance.

However, there is a cost associated with this increased array fragmentation. The fragmentation has doubled or quadrupled the number of bitline sense amplifiers for a given amount of memory and added subwordline decoders that did not exist in the standard array. Typical commodity DRAMs have a cell efficiency (the ratio of memory cells to total chip area) in the range of 50% to 60%. With the architecture optimized for fast access, a cell efficiency of 35% can be achieved (Fig. 4 again). Yet there is still an enormous area advantage over conventional 6T SRAM, which was the only alternative for fast random-access applications.

Finally, breaking free from the constraints of commodity memory standards, the datapath pipeline can be fully optimized for speed. Figure 5 shows the external timing and internal signals for Fast DRAM. The page mode no longer exists. This eliminates the separate activation and precharge commands, leaving only read and write commands. Activation and precharge are performed automatically as part of individual read or write commands. The internal DRAM core completes a full row cycle within one cycle of the external clock. Read data is available at the output pins with a latency of 2. Both row and column addresses are provided at the same time as the read or write command.

Bitline precharge occurs at the beginning of the cycle while the row address is being decoded. Short wordlines and bitlines with minimum RC delay let data sensing occur within a fraction of the external clock cycle. By the time the next rising edge of the clock arrives, the data at the bitline sense amplifier has been sampled by the output pipeline. This enables the array to be precharged in anticipation of the next access, achieving full random access on every clock cycle.

Applying these techniques to a 0.25-µm embedded DRAM proc-ess, Mosaid was able to achieve 160-MHz random-access operation (6.25-ns t_RC) with fully pipelined operation. With a die area penalty of less than 50% over conventional embedded DRAM architectures, but still more than five times the bit density of SRAM, fast embedded DRAM can perform random access cycles at full ASIC internal clock rates.

As 0.18- and 0.13-µm embedded DRAM processes become available, 200- to 250-MHz operation will be achievable with only a modest area increase over conventional DRAM architectures. Fast embedded DRAM opens up a whole new range of SoC applications that demand both high density and high speed.