Quad-Data-Rate SRAM Subsystems Maximize System Performance

Demand for higher-speed systems is a direct result of the Internet boom. RISC CPU speeds are hitting clock rates of 500 MHz and beyond. But static-memory subsystems are still hard-pressed to keep pace, even with the appearance of double-data-rate SRAMs.

One of the quicker solutions is to implement a memory subsystem that employs the latest quad-data-rate (QDR) SRAMs. These memories provide a high-performance architecture targeted at the next generation of switches and routers, which operate at data rates above 200 MHz. Compared to existing memory solutions, QDR SRAMs are expected to greatly increase system-memory bandwidth as well as serve as the main memory for lookup tables, linked lists, and controller buffer memory.

The architecture that they employ enables chip performance to surpass that of existing solutions. Data throughputs of 11.592 Gbits/s are possible. That's about four times the performance of comparable SRAMs in today's market. The QDR architecture was jointly developed by Cypress Semiconductor, IDT, and Micron Technology.

As with any new architecture, the supporting circuits that control it and interface to it are few and far between. Programmable-logic devices, most notably field-programmable gate arrays (FPGAs), can be used to fill the breach. They implement the control and interface logic required to tie CPUs to QDR SRAMs. To understand better, examine the design of a QDR SRAM memory controller that's based on a high-speed FPGA from Xilinx.

Before going into it, though, look at the use of SRAMs in data-communication systems. Initially, they were put to work as data buffers, link-list tables, and pointer tables. The reason was that they allow fast, low-latency access to memory.

Networking applications usually obtain their high bandwidth by having extremely quick memory access. Because of that, they've always used the fastest available SRAMs. Most applications started with asynchronous SRAMs. These SRAMs operated in the 10- to 15-ns speed range.

As demands on networking systems began to increase, applications used the next-best available memory, the pipelined-burst SRAM (PBSRAM). With these SRAMs, networking applications operated with a higher bandwidth, and they simplified the interface by employing synchronous transfers rather than using asynchronous control.

PBSRAMs were optimized for PC cache applications, however, in which access to the memory is dominated by reads with very few writes. That way, wait states between reads and writes don't limit the performance of the caches. Networking applications typically have equal amounts of reads and writes to memory, so PBSRAMS could only offer limited incremental performance.

This limitation led to the development of no-bus-latency SRAMs (NoBL SRAMs). Similar to the zero-bus-turnaround (ZBT) devices offered by other SRAM suppliers, the modified architectures allowed networking applications to operate without any wait states between reads and writes.

NoBL SRAMs enable the complete use of memory bandwidth, which significantly improves the bandwidth of networking applications. The QDR architecture was developed to further improve the interface's bandwidth. It also overcomes several limitations of PBSRAMs and NoBL SRAMs.

The QDR SRAM has separate input and output ports for read and write. Although those ports share address lines, separate differential clocks exist for the input and output ports. Data can be transferred using double-data-rate (DDR) protocols on both input and output ports. Four words can be transferred on every clock cycle: two in and two out of the device (hence the name quad data rate).

These SRAMs are currently available in two types: QDR2 and QDR4. The difference between them is the number of words of data that can be obtained from the memory on a single read or write. The QDR2 provides two words of data on a single read, while the QDR4 provides four. The consortium (Cypress, Micron, and IDT) will support both the QDR2 (Cypress' CY7C1302, Micron's MT54V51218E, and IDT's 71T628) and the QDR4 (Cypress' CY7C1304, Micron's MT54V51218A, and IDT's 71T648). The basic block diagram of the Cypress CY7C1302 consists of a 512-kword by 18-bit memory array with separate pins for input and output data (Fig. 1). The address lines are common for the read and write ports. Separate clocks are provided for the input and output ports.

Four clock lines can be found on the QDR SRAMs: K, —K, C, and —C. The differential K and —K clocks are used for sampling the inputs. During a read, the differential C and —C clocks drive data out from the SRAM. All transactions are initiated on the rising edge of the K clock.

The address for a read is latched on that rising edge, while the address for a write is latched in on the rising edge of the —K clock. The data- and byte-write signals for a write are latched on the rising edge of the K and the rising edge of the —K clock. QDR memories also have an option to operate only on the input clocks. This is called the single-clock mode, in which the K and —K clocks push the data out of the SRAM.

Two control signals, —RPS and —WPS, select a read or write operation on the SRAM. These are sampled on the rising edge of the K clock and used to initiate reads or writes. The timing diagram for a simple operation on the two-word-burst QDR chip shows that reads and writes also start on the rising edge of the K clock (Fig. 2). In the first clock cycle, —RPS and —WPS are low. The address for Read (A) is latched on the rising edge of the K clock, while Write (B) is latched on the rising edge of the —K clock.

The data for the write to address B is latched in the same cycle on the rising edges of the K clock (D\[B\]) and the —K clock (D\[B+1\]). The byte-writes are latched on them, too. The SRAM stores the data for the write in registers. This write is completed in a later cycle.

The read for address (A) is completed in cycle 1. On the rising edge of the C clock, the first word of data (Q\[A\]) is driven out of the SRAM. The rising edge of the —C clock pushes out the second word of data (Q\[A+1\]). A 166-MHz QDR SRAM drives out the data in just 2.5 ns.

Back-to-back cycles can be started on the CY7C1302. In clock cycle 2, a read on address C begins even before the data from the read started in the previous cycle is driven out. If a read and write to the same address in the same cycle is started, the SRAM forwards the data from the read port to the write port and ensures that valid data is driven out on the data bus. Data coherency is guaranteed in all cycles.

To overcome some of those issues with NoBL and PBSRAM architectures, the QDR chips boast separate input and output data ports. These ports solve problems that have dogged all common I/O devices. For example, bus contention occurs frequently during a read/write that can occur in a networking environment. It happens when the SRAM drives the data on a read faster than the ASIC can take data off the bus after a write. Most common I/O devices are susceptible to this, because the same bus has to be used for reads to and from the SRAM. As operating frequencies go up, the chances of this happening increase dramatically. Separating data buses for read and write operations guarantee that bus contention won't occur.

Common I/O devices also suffer from a constant flow of data that requires the bus to be turned around for a read and write. The flow of data between the controller and the SRAM then becomes non-uniform. Thanks to its separate I/O, the QDR SRAM permits a constant flow of data. Systems can then achieve greater throughput than those that have common I/O SRAMs.

These memories also permit migration to frequency ranges previously unavailable to earlier generations. Standard PBSRAMs and NoBL SRAMs can be made to run at very high frequencies. But the inherent limitation of operating a common I/O data path at such levels limits the maximum bus frequency. Because they lack the overhead limitations of turning data buses around on read/write transitions, QDR SRAMs can run at the native frequency of the SRAM.

The signal levels used on the I/O pins are HSTL voltage levels. They provide better migration to lower voltage levels. HSTL allows low-swing, high-speed operation of the inputs and outputs. By using 166-MHz QDR SRAMs and 166-MHz-capable FPGAs, QDR memories can dramatically improve network system performance when going from a cache to a networking type of application. Some assumptions are that the interface operates at 166 MHz. The QDR SRAM has an 18-bit interface, while the synchronous pipelined SRAM, NoBL, and DDR SRAMs have 36-bit interfaces.

Since commercial memory-controller chips aren't available yet, check out the design of a circuit that will interface a bank of four QDR chips. The chips are connected in a depth-expansion mode to a host CPU. A Xilinx Spartan-II FPGA will be used to implement the control logic for the memory interface (Fig. 3).

Each of the QDR SRAMs gets separate control signals for the read and write ports, while the address and data ports are common for all of the SRAMs. These SRAMs form a 2-Mword by 18-bit storage array. The controller generates all of the signals for the memory bank. It also supports concurrent DDR operations on all of the inputs and outputs, and lets byte-write operations into the memory bank.

Operating in the single-clock mode, the controller really simplifies the memory interface. At 100 MHz, it provides a bandwidth of 7.2 Gbits/s. The controller employs a command-based interface with a 2-bit command input (01 read, 10 write, 11 read/write) and has independent read and write state machines (Fig. 4). Those state machines are shown separately to simplify the understanding of the memory controller's operation. Each of their sections run in a pipelined fashion. The other inputs to the controller include clock (Clk), write address (Waddress), read address (Raddress), write data (Wdata), read data (Rdata), and byte-write control (BWS\[0, 1\]) signals.

From the controller's point of view, all signals are relative to the SRAM clocks. It sets up the address and data inputs within the SRAM's setup-/hold-time window requirements. The state machine provides the addresses and data on certain clock edges, while the SRAM latches in the addresses and data on those edges. The memory controller looks at the command signal (Cmd) on its input port on every clock. Depending on that clock, either read, write, or read/write operations are completed on the bank of memory. The memory controller generates the Read Port Selects (—RPSs) and Write Port Selects (WPSs) for the different SRAMs, depending on the state of the command (Cmd \[0, 1\]) inputs and the higher-order address lines.

Traditionally, designing memory systems that operate at speeds above 100 MHz has required external chips to minimize clock skew. But some of the latest FPGAs, like the Spartan-II, incorporate multiple on-chip delay-locked loops (DLLs) that can deskew the internal global clock network. Or, they deskew clocks fed off-chip to other system components. Such on-chip DLLs eliminate the need for external clock-management devices, simplifying system design.

A pair of DLLs on the FPGA can be utilized to achieve zero clock skew between the FPGA's on-chip clock and the QDR SRAM clock (Fig. 5). In addition to clock deskewing, the Spartan-II DLLs provide features like phase adjustment, frequency division, and frequency multiplication. When working with DDR and QDR memory devices, the availability of a double-frequency clock that's phase-locked to the system clock is particularly vital. With the on-chip global clock-distribution network, high-speed synchronous I/O resources, and the programmable I/O lines on the FPGA for different signaling standards, the QDR SRAM interface achieves a data throughput of 7.2 Gbits/s.

The memory controller must support DDR transfers on all I/Os, byte-level operation on the memory bank, and concurrent reads and writes to all SRAM blocks. It also has to provide HSTL-compatible interfaces to the QDR SRAMs. At the system level, the controller must have 20 write-address lines, 20 read-address lines, 36 write-data lines, 36 read-data lines, four command signals, four byte-write signals, and one each for the clock, reset, and data-ready signals. On the QDR memory side, it has to provide 18 write-data lines, 18 read-data lines, and 18 address lines, as well as two clock lines, two byte-write lines, and one write-port and one read-port select line.

When performing a write cycle using the QDR_2 SRAM, the controller uses the rising edge of the clock at T1 to place the Write command on the bus, along with the Waddress and Wdata (Fig. 6a). The memory controller latches Waddress and "ad" according to the Write command. Based on address lines WA(19) and WA(20), the memory controller drives —WPS low for the particular bank of memory. It also drives the SRAM_Data lines on T2 and T2'. The SRAM Address is driven on T2'. As it latches that address, QDR_2 completes the write.

Similarly, for a read cycle, at T1 the system places the Read command on the bus, as well as the Read Address (Raddress) (Fig. 6b). The memory controller latches Raddress based on the Read command. Going by the address on WA(19) and WA(20), the controller drives —RPS low for a particular bank of memory. It drives the SRAM address on T3. The QDR_2 drives the read signal from the rising edges of T4 and T4'. The controller latches the SRAM data lines on those rising edges, as well. It also drives the system read-data bus on the next cycle, along with the Read Ready signal.

Implementing The Controller To implement the memory controller, the design can be divided into several sections. Look at the clock-generation portion (Fig. 7). In about a dozen lines of code, a relatively straightforward VHDL description can be written of the clock signal that's to phase lock the internal FPGA clock to the system clock and generate a 2X system clock. Similarly, the QDR SRAM interface contains well under a dozen lines of code to implement the DDR interface (Fig. 8a). The system interface is a straightforward setup of the register read, write, and section operations (Fig. 8b).

When synthesized, the resulting logic diagram of the complete controller shows the memory interface with its three 18-bit buses and the host interface with the dual 36-bit data buses and 18-bit address buses (Fig. 9). The FPGA operates internally at 200 MHz. Externally, the buses need only operate at 100 MHz, because the DDR interface transfers data on both the leading and trailing edges. The 36-bit read data path from the host is internally split into two 18-bit sections and latched by separate registers. These registers are clocked at 200 MHz, allowing one to send or receive data on both edges of the clock.

The four on-chip DLLs available on the Spartan-II FPGA family can deskew either the internal global clock network or clocks fed off-chip to other system components. The two DLLs shown permit the controller to achieve zero clock skew between the FPGA's on-chip clock and the QDR SRAM clock.

While working with double- and quadruple-data-rate memory devices, it's important to know the availability of a double-frequency clock that's phase-locked to the system clock. Used in conjunction with the on-chip global clock-distribution network, high-speed synchronous I/O resources, and Select-I/O flexible signaling standards, the FPGA-to-QDR SRAM interface achieves a data throughput of 7200 Mbits/s.

The most difficult task for this design is to meet the timing requirements for the QDR_2. All QDR_2 signals are registered in the I/O buffers and use HSTL buffers. For the write cycle, timing all signals must meet those setup-and-hold-time requirements. That means dealing with the sum of propagation delays from the Spartan FPGA (clock to output), the board-wiring delay, and the QDR memory setup time. Those delays must total less than the cycle time of the write operation:

FPGA Tco (2.5 ns) + board Tpd (0.6 ns) + QDR SRAM Tsu (0.8 ns)

The clock-to-out and QDR setup-time values are 2.5 and 0.8 ns, respectively. Consequently, there's a good margin for board delay. The QDR memory has a hold-time requirement of 0.5 ns.

During the read cycle, data must meet the setup-and-hold time of the FPGA:

QDR SRAM Tco (2.5 ns) + board Tpd (0.6 ns) + Spartan-II Tsu (1.55 ns)

The setup-time requirement for the Spartan-II is 1.55 ns. Along with a clock-to-out timing on the QDR SRAM of 2.5 ns, this demand permits a good margin for operation at 100 MHz.

To implement the controller in the FPGA requires two DLLs, two global clock buffers, and 119 I/O buffers. The design can be verified with a back-annotated simulation at 100 MHz. By using a faster Spartan-II FPGA, the interface performance can be improved even further.