Wider Bandwidths Surpass Density As Driving Force For New DRAMs, CPUs

Memory density has traditionally been a key technology driver that has taken center stage in the digital technology papers at past International Solid-State Circuits Conferences (ISSCCs). But this year's presentations, which took place earlier this month in San Francisco, Calif., focused more on performance issues. Improved bandwidth on memory buses and higher-speed processor buses are key developments that will deliver higher system performance.

With that said, this year's ISSCC didn't disappoint designers who wanted to hear about the latest high-density DRAMs, SRAMs, and nonvolatile memories in Sessions 24, 11, and 2. Multigigahertz CPUs also were hot subjects at this year's conference, with close to half a dozen presentations that detailed CPUs and compute blocks running at clock speeds that exceeded 1 GHz. These were detailed in Sessions 15 and 20. Additionally, high-speed bus interfaces, with data-signaling speeds of up to 6.4 Gbits/s, were the highlight of Session 4, while Session 25 covered high-speed clocking schemes.

The one memory paper at ISSCC that pushed density to the maximum described a 4-Gbit double-data-rate synchronous DRAM developed by researchers at Samsung Electronics, Kyungki, Korea (paper 24.1). Thought to be the first chip to contain over 4 billion transistors, the memory occupies a rather healthy chip area of 645 mm², even with the use of 0.1-µm design rules.

To minimize inter-bit-line coupling noise in the large array, the memory employs a twisted open-bit-line architecture. Additionally, a gain-controlled presensing scheme and active calibration of the bit-line precharge voltage are used. Both help to improve the sense-amplifier sensitivity and sensing margins.

In a presentation in the same session, Elpida Memory Inc., Kanagawa, Japan, focused on a technology also capable of producing multigigabit DRAMs. (Elpida is a partnership company formed by Hitachi Ltd. and NEC Corp.) The researchers employed 0.13-µm design rules to develop the process that combines an open-bit-line trench-capacitor memory cell, a distributed overdriven sensing scheme that operates below 1 V, and a stacked flash-fuse structure to control the redundant memory rows and columns.

The fuse structure consists of three series flash fuses fabricated with standard CMOS transistors that don't require any additional process steps (Fig. 1). The resulting OR function can reduce the fuse failure rate by almost 10 orders of magnitude versus the traditional metal-link fuses, improving production yields. Designers verified this combination with a 256-Mbit test chip that achieved a 208-MHz cycle time in the memory array.

Details of the first 512-Mbit DRAM with a second-generation, 600-Mbit/s, double-data-rate interface (DDR2) were unveiled in a paper jointly presented by researchers from IBM Corp. and Infineon Technologies, both located in Hopewell Junction, N.Y. The improved interface allows the memory array to accept a column-address-strobe (CAS) command immediately after the row-address-strobe command (RAS) is issued. The CAS command and address are then held in an address FIFO buffer for the duration of the address latch signal.

In contrast, the first-generation DDR interface typically has a RAS-CAS delay time of about 13 ns. That requires at least four cycles of a 300-MHz clock, which reduces bus efficiency. Thanks to the elimination of the delay, the DDR2 interface can handle back-to-back RAS commands, achieving 100% bus utilization and a 600-Mbit/s data transfer rate on each data pin.

The three other DRAM papers presented in Session 24 focused on the design of embedded DRAM (eDRAM) macros. The first in that group, delivered by a joint development team from Mitsubishi Electric Corp., Hyogo, Japan, and Matsushita Electric Industrial Co. Ltd., Kyoto, Japan, targeted low power consumption. United Memories Inc., Colorado Springs, Colo., and Sony Corp., Tokyo, Japan, jointly presented the second eDRAM presentation, which took aim at high-performance graphics applications. Mitsubishi ended the session with a second eDRAM macro that the company developed to deliver a very flexible solution.

Power-Cutting eDRAM Macro Blocks of eDRAM have been incorporated into graphics controllers and other high-performance systems where capacity and system performance were the prime considerations. But, new portable systems that need the density of DRAM but must operate at considerably lower power levels than the graphics systems don't yet have a low-power eDRAM solution. That promises to change, though, thanks to a 32-Mbit eDRAM macro developed by Mitsubishi and Sony. The device trims active power to less than 200 mW when operating at 230 MHz, and standby power to just 125 mW. The companies also developed a 64-Mbit macro.

To achieve the low power drain, the chip's designers were able to back off a little on device performance because the intended market—MP3 players, MPEG devices, and other small handheld devices—wasn't as performance-critical as, for instance, the laptop graphics market. To reduce performance losses due to operating from a 1-V supply, four levels of copper interconnect and low-resistance poly-metal gates that have a resistance of only 4 W/square are deployed.

Targeting high-bandwidth applications required in a 3D graphics engine, the United Memories/Sony design team crafted a 16-Mbit DDR macro that can transfer data at an effective rate of 1.43 GHz on each I/O line. The macro actually clocks at 714 MHz and provides 256 data inputs plus 256 separate data outputs. Its designers employed local read data drivers instead of pass gates to achieve a relatively low active power and fast reads.

These drivers are composed of a three-state NMOS push-pull differential driver with a control input. Each driver has a pair of local read-data lines as inputs and drives a pair of read-data lines. The drivers cut power consumption by reducing the capacitance and signal swing on the nonprecharged local read-data lines. Speed is improved as well because the drivers act as low-impedance buffers between the read amplifier and the read-data lines.

Focusing on flexibility, researchers at Mitsubishi created an eDRAM macro that can be configured in any of 120,000 options. The macro also may be set to operate from a 1.2-V supply for clock speeds of up to 100 MHz, and from a supply of 1.8 V if operation at up to 200 MHz is desired. To ease testing, its designers implemented an on-chip tester that can test the various macros and accelerate the test time by a factor of 64. This is accomplished by testing 512 bits in parallel and then determining the repair sequence if bad bits are detected.

The ability to integrate lots of memory on processor chips has almost changed the way that we look at CPUs. On some CPUs, approximately two-thirds of the chip area is now consumed by first- and second-level SRAM caches, raising the question, is the chip now an intelligent memory, or still a CPU? Researchers at Hewlett-Packard Co., Ft. Collins, Colo., in Session 11, have taken that first approach to heart in describing the design of a PA-RISC processor as a 900-MHz, 2.25-Mbyte cache with an on-chip CPU. The chip contains almost double the amount of memory that's on the 500-MHz processor the company unveiled in 1999. This performance is obtained by using a 0.18-µm silicon-on-insulator process that includes copper interconnect for the 900-MHz operating speed.

Another paper in the SRAM session detailed a 32-kbyte SRAM four-way set-associative cache macro developed by Hitachi Ltd., Tokyo, Japan, and its U.S. division, Hitachi Semiconductor America Inc., San Jose, Calif. Although the density is modest, the unique aspect of this macro is its wide operating-voltage range—from 0.65 to 2 V. That allows the macro to be used in a wide variety of chip designs that can operate at frequencies from 120 MHz for the low-voltage options to over 1 GHz when run from 2 V. The cache performance comes from using a voltage-adapted timing-generation scheme with plural dummy cells and a lithographical-symmetric memory-cell design.

Advances in flash and other nonvolatile memory technologies, extensively covered in Session 2, show how flash memory densities have practically caught up with that of DRAMs. Highlighting that effort, designers from Samsung Electronics presented their work toward creating a 1-Gbit multilevel NAND flash memory that can operate from a 3.3-V supply. The multilevel memory cell, as in previous multilevel cells, allows the company to store two bits in every memory cell, thereby cutting in half the number of memory cells required to hold 1 Gbit. But controlling the threshold voltages to define the levels has been one of the biggest challenges in all multilevel cell designs as dimensions shrink. The memory design also includes a fuse that allows the company to turn the multilevel cell into a single-level cell and deliver a 512-Mbit memory if the thresholds aren't as stable as necessary for the multilevel cell needs.

As the word-line pitch is scaled down in the memory array, the threshold voltage of the programmed cell can be influenced by adjacent word-line interference. To compensate for that, designers developed a voltage-ramping circuit that changes the channel potential underneath the selected cell. The result is that the string-select-line coupling is considerably reduced, which in turn drastically reduces the cell threshold-voltage shift due to wordline coupling.

Multilevel memory cells also are used in a 512-Mbit AND-type memory design developed by Hitachi Ltd. Unlike the Samsung device that operates with a 3.3-V supply, the Hitachi chip targets systems that operate from a 1.8-V supply. Because the required threshold levels are internally generated, they're actually independent of the external supply level and won't be affected by the change in the supply level. The voltage generator that boosts the voltage to program the cells will be affected, however, and designers crafted a new parallel-style charge pump that works better with low supply voltages than the traditional serial-type charge pump. By combining the two approaches, designers achieved the desired programming voltage levels (Fig. 2).

Typically when flash memories are programmed, data reads are either locked out, or the programming operation must be suspended until the read operation completes. In many applications, though, both read and write operations have to be done simultaneously to improve system efficiency. Designers at Intel Corp., Folsom, Calif., have addressed that problem with a 64-Mbit flash memory design that not only operates at 1.8 V, but also permits data reads while writing data to the flash array.

To accomplish that, designers created a multipartition architecture that allows programming or erasing in one partition while reading from another partition. Unlike previous partitioned designs that typically only had two partitions, the Intel design allows the memory to be divided into any number of partitions.

In this 64-Mbit design, 4-Mbit partitions are used. Implemented in a 0.18-µm process, the chip offers an asynchronous page-mode access time of 18 ns, and a burst mode that performs synchronous reads at up to 100 MHz with zero wait states. The multiple partitions allow two processors to interleave code operations while program operations take place in the background.

One more flash presentation described an embedded 16-Mbit memory module capable of operating from only 1.2 V, which is one of the lowest operating voltages yet reported. Developed by Philips Research Labs, Eindhoven, The Netherlands, the NOR-style architecture also includes error-correction logic, a 128-bit-to-8-bit multiplexer, and a test-mode shift register.

To achieve 1.2-V operation, designers opted to employ a two-transistor-per-cell approach. One additional benefit is that the dual-transistor scheme doesn't suffer from over-erase problems, so designers don't have to use a complex and time-consuming erase algorithm, allowing the entire module to be erased in less than 100 ms. Page program times are short, too—just 5 ms.

The last three papers in Session 2 dealt with the relatively new nonvolatile memory technology of ferroelectric RAMs. A joint paper by Fujitsu Ltd., Tokyo, Japan, and Ramtron International Corp., Colorado Springs, Colo., highlighted a 1-Mbit FRAM design that incorporates a ferro-programmable redundancy scheme. In the second presentation, Samsung Electronics, Kiheung, Korea, showed a 4 Mbit FRAM that includes a new read scheme. This improves memory performance, allowing read-access times of 85 ns. Packing the most ferroelectric storage, an 8-Mbit chip design was also detailed by Toshiba Corp., Yokohama, Japan.

The key aspect of the joint Fujitsu/Ramtron paper is the first use of a single-transistor/single-capacitor design for the memory cell. This is a major step forward in density versus the previous designs that employ a two-transistor/two-capacitor storage cell.

To make the 1T/1C cell design production worthy, its designers added a ferro-based redundancy scheme using the same ferroelectric capacitor as the one in the memory cell. That eliminates the need to add more process steps, which might be necessary if a metal fuse or some other approach were implemented.

In Samsung's 4-Mbit design, engineers created a common-plate folded bit-line architecture in the memory array, which lowers the internal noise level without any appreciable area penalty. This compensates for the bit-line capacitance imbalance without any speed loss, improving overall memory performance.

Pushing FRAM capacities to a new high, the 8-Mbit design from Toshiba employs a novel chain-memory architecture that reduces the size of the memory array when used in conjunction with several other approaches. First detailed in 1998, the chain FRAM concept employs a cross-point type cell. The previous cross-point cells were too large for commercial use, however, so designers developed a new cell arrangement that shifts the cell in the neighboring column by one word-line pitch. This arrangement saves space because two contacts are placed in a gap formed by the bottom capacitor electrodes.

New-Generation Processors High-performance memories are required to feed data to the new generations of high-speed processors, such as those described in Session 15. In the leadoff paper, engineers at Intel Corp., Chandler, Ariz., showed the design of a performance-scalable StrongArm processor that can operate at 800 MHz and consumes just 1.55 W when powered by a 1.65-V supply. It can also be scaled down to 200 MHz with a power consumption of only 55 mW when powered by a 0.7-V supply. Aggressive power management, gated logic, and a virtually addressed cache scheme are employed to achieve the scalability.

Pushing CPU clock speeds to beyond 1 GHz, a fourth-generation Power4 architecture processor design was unveiled by IBM Corp., Austin, Texas. The Power4 chip includes two independent processor cores, a shared L2 cache, an L3 directory, and all the logic needed to form large symmetrical multiprocessor systems. Over 170 million transistors were used to implement the Power4 chip, which combines 0.18-µm design rules, silicon-on-insulator processing, and seven levels of copper metallization.

Each of the CPU cores on the chip is an out-of-order superscalar design that includes an instruction fetch unit with its own 64-kbyte L1 instruction cache, dual fixed-point and dual floating-point execution units, a pair of load-store execution units with a dual-ported 32-kbyte L1 data cache, and additional execution support logic. Instructions can be issued to each execution unit every cycle. So, up to 200 instructions can be in various stages of execution at any time on the chip.

The unified L2 cache is organized as an eight-way set-associative memory configured with three independent cache controllers. In aggregate, 12 outstanding L2 misses can be supported by the cache. Such high performance with the dual CPUs on the chip will come at a price, though. When powered by a 1.5-V supply, the Power4 chip will consume about 115 W.

IBM also took the wraps off of a first-generation processor that's now called the Z900, but formerly known as the S/390 architecture. Able to operate at 1.1 GHz, the processor can run the OS/390 operating system and be used in multiprocessing configurations. This implementation runs approximately 45% faster than previous versions.

The speed improvement stems from a combination of technology scaling, improved full-custom circuit designs, and a four-dimensional high-performance gate library, as well as new logic-synthesis and circuit-tuning algorithms. A short pipeline of just seven stages keeps the architecture relatively simple and allows the performance to stay proportional to the product of clock frequency and the number of instructions per clock, rather than just the clock frequency alone.

Also pushing the clock speeds to 1.2 GHz, designers at Compaq Computer Corp., Shrewsbury, Mass., have enhanced the Alpha microprocessor architecture to deliver a bus bandwidth of 44.8 Gbytes/s. A 1.75-Mbyte on-chip ECC-protected second-level cache that's seven-way set associative delivers a bandwidth of 19.2 Gbytes/s.

Dual Rambus memory controllers support eight Rambus channels, each running at 800 Mbits/s (400 MHz). Additionally, the processor incorporates four 6.4-Gbyte/s interprocessor communication ports and a separate I/O port that also transfers data at up to 6.4 Gbytes/s. Packing about 152 million transistors, the processor consumes even more power than the Power4 CPU—an estimated 125 W from a 1.5-V supply. Seven levels of copper interconnect are used to minimize signal delays and metal migration.

Two additional papers in the session focused on highly integrated embedded processors. The first, from MIPS Technologies Inc., Mountain View, Calif., showed off aspects of the R20K processor. It combines the MIPS64 instruction set and the MIPS-3D application-specific instruction extensions that accelerate geometry processing for 3D graphics. The processor consists of a seven-stage pipeline dual-issue core that packs two 64-bit integer execution units, plus an IEEE-754-compliant floating-point unit that supports the paired-single, single-instruction/multiple-data format.

The other embedded processor is the first implementation by Sun Microsystems Inc., Palo Alto, Calif., of the MAJC multithreaded dual 32-bit microprocessor. Actually a system on a chip, the dual-CPU solution delivers a throughput of 6 GFLOPS and 13 gigaoperations/s when clocked at 500 MHz. The CPUs share a four-way set-associative 16-kbyte data cache and an on-chip 4-Gbyte/s switch. Each CPU is a four-issue very-long-instruction-word engine with its own 16-kbyte instruction cache.

To round out the system interface, the chip includes a graphics preprocessor that supports real-time 3D geometry operations. Also on-chip is a direct Rambus controller that can transfer data at peak rates of up to 1.6 Gbytes/s when clocked at 400 MHz. Two additional high-speed parallel interfaces support data transfers of up to 4 Gbytes/s when clocked at 250 MHz. An on-chip PCI interface provides transfers at up to 264 Mbytes/s when clocked at 66 MHz.

Additional processor developments and processor subsections were highlighted in Session 20 as well. In that session, Hewlett-Packard detailed a 1-GHz implementation of its PA-RISC processor, which is a speed upgrade to the 900-MHz cache with on-chip CPU described in Session 11. Intel, though, showed off an IA32 processor in Session 20 that includes a 4-GHz integer execution unit.

Although most of the IA32 processor runs at 2 GHz, and the system bus logic and execution trace cache run at 1 GHz, the tight core containing two integer ALUs, as well as their bypass logic and schedulers, runs at 4 GHz. This 4-GHz core is kept as small as possible to minimize metal lengths and loading.

The most sensitive loops, including the ALU and first-level data-cache access, together with their innermost bypasses, are contained entirely in the high-speed core. Functions that don't need to be in the key low-latency ALU and L1 data-cache loops are placed elsewhere.

ALU operations are performed in a sequence of three fast cycles. In the first, the low-order 16 bits are computed and immediately available to feed the low 16 bits of a dependent operation in the next fast cycle. The high-order 16 bits are processed in the following fast cycle, and that computation uses the 16-bit carry just generated by the first computation. This result will be available to the next instruction, which has already done its low-order part. The ALU flags are then processed in the third fast cycle. This entire operation is referred to as a staggered Add.

The remaining papers in Session 20 dealt with achieving high performance in pieces of advanced processors. For example, IBM Corp., Boeblingen, Germany, showed how it achieved 1.8-GHz performance in an instruction-window buffer. This buffer implements portions of the processor that are used for renaming, reservation station, and reorder buffering. To obtain the high speed, engineers opted for a mix of static and delayed-reset dynamic circuit macros to implement the window buffer.

Very-high-speed 64-bit ALUs were the subject of a second presentation by Intel. The company developed the ALUs as part of a high-performance 64-bit execution unit that would be employed in multiple instances as part of a high-throughput server. Based on a single-rail, radix-2, Han-Carlson adder core, the ALU includes two wide multiplexer stages and a write-back bus (Fig. 3). It performs carry-merge operations on alternate bit slices and, therefore, requires only half as many carry-merge gates as does a conventional Kogge-Stone adder. The ALU also has around 21% less active leakage, and an efficient energy-delay characteristic making it well suited for a 64-bit CMOS ALU.

One interesting challenge is taking a CMOS processor that employs aluminum interconnects and 0.35-µm design rules, and updating it to a 0.18-µm process with seven layers of copper interconnect. Designers at Compaq took up that challenge and updated one of its Alpha processors. The updated version can now clock at more than 1.3 GHz when powered by a 1.65-V supply. But it still consumes a respectable amount of power—65 W—even though that value is considerably lower than the 0.35-µm version.

As part of the switch from aluminum to copper, designers also had to modify some of the circuit structures and selectively employ low-threshold-voltage transistors to solve some circuit-speed issues. Path candidates were determined by timing analysis, but paths that included dynamic nodes were forbidden.

Getting the processors to clock at the gigahertz speeds requires solid clock-generation circuitry and superior clock-distribution networks on the chips. Presentations in Session 25 demonstrated the ability to implement on-chip PLLs with frequencies as high as 4 GHz and edge timing skews of less than 10 ps. For example, designers at SiByte Inc., Santa Clara, Calif., crafted a 4 GHz PLL that includes its own voltage regulation to achieve a 40-dB minimum power-supply rejection ratio. Peak cycle-to-cycle jitter is less than 25 ns when running at 700 MHz with a 500-mV step on the regulator's 3.3-V supply.

A joint presentation by Multigig Corp. Ltd. and North Carolina State University, Raleigh, showed how designers leveraged rings of differential lines that are driven by inverter pairs. These lines distribute a low-skew, low-jitter clock over an arbitrary large die area. Demonstrations of 950-MHz and 3.42-GHz rings were detailed.

A multigigahertz clocking scheme developed by Intel for its Pentium 4 processor shows how two tightly-synchronized PLLs can be used to generate CPU core and I/O clocks while keeping the skew to just 20 ps. The clocks are distributed to 47 domains and are adjusted for supply, loading, and on-chip variations to control the skew.