Interfacing FPGAs To High-Speed DRAMs Puts Designers To The Test

FPGAs are finding greater use as core components in systems for networking, communications, storage, and high-performance computing applications requiring complex data processing.

So, it is now mandatory that FPGA vendors support high-speed, external memory interfaces. Recognizing this, today's FPGAs offer specialized features that allow them to interface directly with a variety of high-performance memory devices. We'll focus here on the design of high-speed DRAM-to-FPGA interfaces. This article describes the challenges and barriers involved with these interfaces and highlights solutions to address these obstacles.

Rest assured that designing high-speed external-memory interfaces is no simple task. Synchronous DRAMs, for example, have evolved into high-performance, high-density memories and are now being used in a host of applications. The latest DRAM memories—DDR SDRAM, DDR2, and RLDRAM II—support frequencies ranging from 133 MHz (266 Mbits/s) to 400 MHz (800 Mbits/s).

Thus, designers are often confronted with the challenges of DQ-DQS phase management, tight timing constraints, signal-integrity issues, and simultaneously switching output (SSO) noise. Plus, certain board-design issues could prolong design cycles or force them to accept reduced performance. To make matters worse, all of these hurdles become more pronounced at high frequencies.

DQ-DQS PHASE-RELATIONSHIP MANAGEMENT DDR SDRAMs rely on a data strobe signal (DQS) to achieve high-speed operation. DQS is a non-continuous-running strobe used for clocking data on the DQ lines. It's transmitted externally along with the data signals (DQ) to ensure that they track each other with temperature and voltage changes. The DDR SDRAM uses on-chip delay-locked loops (DLLs) to output DQS relative to the corresponding DQs.

The phase relationship between the DQ and DQS signals is important for DDR SDRAM and DDR2 interfaces When writing to the DRAM, the memory controller in the FPGA must generate a DQS signal that's center-aligned within the DQ data signals. When reading from the memory device, the DQS coming into the FPGA is edge-aligned with respect to the DQ signals (Fig. 1).

Upon receiving the DQS signal, the memory controller must phase-shift it to be center-aligned with the DQ signals. The amount of time that the DQS must be delayed is governed by board-induced skew between the DQS and DQ groups, the resulting data-valid window at the controller, and the sampling-window requirements at the controller input registers.

This is one of the most challenging requirements for DRAM controller designs. Memory-interface designers can employ one of several techniques to align the DQS to the center of the data-valid window—board trace delay on DQS, on-chip trace delay on DQS, on-chip DLLs, or phase-locked loops (PLLs).

Board Trace Delay On DQS: This is the traditional approach for aligning DQS and a related DQ group. But the technique is inefficient and proves to be a performance barrier in sophisticated systems for the following reasons:

Using the 400-Mbit/s case as an example, the nominal delay for DQS with respect to DQ is 1.25 ns (assuming that the required phase shift for center-aligning the DQS signal with the DQ signal is 90°). To achieve this delay, approximately 7 to 8 in. of trace length must be added to the DQS line (based on an approximate delay of 160 ps/in. for an FR4 laminate Microstrip with a 50-Ω characteristic impedance). Not only does this complicate board layout, it also can result in increased board cost if extra signal layers are required. This is especially true when interfacing with DIMMs, since routing the additional length needed for each DQS signal can be difficult.
The required delay and resulting trace length must be accurately predetermined. This locks the interface to a specific frequency, leaving designers little flexibility. Any changes in interface frequency would require laying out the board again.
Increased trace length also results in higher loss on the DQS line. Thus, rise and fall times are compromised, limiting the maximum attainable frequency.

On-Chip Delay Elements: This approach uses a number of delay elements connected in series to achieve a predetermined delay. The delay, and corresponding number of delay elements required to achieve it, must be calculated based on the frequency of operation and the right number of elements for each frequency bin. Designers can then use varying design techniques, employing a combination of coarse and fine delays to further fine-tune to the desired value. However, delay elements are inherently susceptible to process, voltage, and temperature (PVT) variations, which can be up to ±40%. This variation in delay decreases the effective sampling window for the controller, and it doesn't scale with frequency. The limitation of this approach then makes it useful only for lower frequencies (133 MHz and below).

On-Chip DLLs: To solve the design issues in the above two implementations, designers can utilize on-chip DLLs to introduce delay onto the DQS lines. By using a reference clock at the desired interface frequency and basing the required delay as a percent of that clock period, the DLLs can then pick the right number of delay elements to achieve the desired delay.

For example, Altera uses this method to achieve the 90° DQS phase-shift during the read operation. These FPGAs feature on-chip DQS phase-shift circuitry and dedicated DQS-DQ I/O groups at the top and bottom of the chip. When not interfacing with external memory, these pins can be used as general-purpose I/Os.

However, when interfacing with external memories such as DDR SDRAM, these pins must be used for DQS. Each DQS signal is associated with a group of DQ signals. DQS:DQ group ratios can be either 1:4, 1:8, 1:16, 1:18, 1:32, or 1:36 when using Stratix II FPGAs and 1:8, 1:16, or 1:32 with Stratix FPGAs.

The dedicated DQS pins tie internally to a set of delay elements before being routed to the I/O input registers. The cumulative delay of these elements is controlled by the DQS phase-shift circuitry. The dedicated DQS phase-shift circuitry, which consists of a DLL and control circuitry, enables automatic, on-chip delay insertion on incoming DQS signals during a read operation. This DQS phase-shift circuitry uses a frequency reference to generate control signals for the delay elements on each of the dedicated DQS pins, allowing it to compensate for PVT variations. Further, to minimize channel-to-channel skew, the phase-shifted DQS signal is transferred to the DQ I/O elements (IOE) via a balanced clock network.

RESYNCHRONIZATION OF READ DATA TO THE SYSTEM CLOCK Another challenging aspect of DRAM interface design is converting the read data from the DQS clock domain to the system clock domain. Read data from the DRAM is first captured in the DQS clock domain in the memory controller. This data then must be transferred to the system clock domain. To ensure that the DQ signals are captured correctly in the FPGA, designers need to determine the skew between the DQS and system clocks. Minimum and maximum timing analysis must be performed on the following to calculate the skew accurately (Fig. 2):

Delay from PLL clock output to the pin (t_PD1)
Clock board trace length (t_PD2)
Access window of DQS from clock (t_DQSCK from the DDR memory data sheet)
DQS board trace length (t_PD3)
Delay from the DQS pin at the FPGA to the I/O element (t_PD4)
Micro clock-to-out number for the I/O element register (t_CO1)
Delay from the I/O register to the resynchronization register (t_PD5).

To find a safe resynchronization window, designers need to calculate the minimum and maximum delay of the system by adding all of the delays (also known as the round-trip delay) listed above (Fig. 3). The resynchronization window can be obtained by using the following equation:

Resynchronization Window = minimum round-trip delay + one clock period − maximum round-trip delay − maximum micro setup and hold time of the resynchronization register

If the resynchronization window falls outside of the system clock edge, the designer needs to use another phase-shifted PLL output clock so that the edge will be within this window. The task of calculating the round-trip delay and estimating the clock phase for the resynchronization clock is error-prone as well as time-consuming.

Many times, designers use trial and error to find out the resynchronization clock phase. Some FPGA vendors provide design aids to reduce or eliminate this trial-and-error process. For example, Altera's memory-controller IP cores come with a round-trip delay calculator that lets designers calculate the resynchronization window for their particular system. Designers can input trace delay and other delay components specific to their system. The round-trip delay calculator will estimate the skew between the system clock and the DQS domain. If a phase-shifted output from the PLL is required, it will also specify the amount of phase shift required to capture data correctly.

Another technique for resynchronization is to use a feedback clock and an additional Read PLL as shown in Figure 4. The board trace for the feedback clock, FB_CLK, from the memory should be the same as the board trace lengths of the DQ and DQS signals. The FB_CLK is connected to the DRAM CLK pin and routed back to the FPGA. The Read PLL phase-shifts the incoming clock FB_CLK so that it correctly captures the read data from the DQS domain to the system clock domain. The phase-shift amount is the sum of the ±t_DQSCK value from the DRAM; any board-trace skew between DQS, CLK, and FB_CLK traces; and the delay between the IOE register and Resynchronization register.

SIGNAL-INTEGRITY AND BOARD-DESIGN CHALLENGES Another common challenge associated with memory-interface design is maintaining the signal integrity. The wide bus widths of these interfaces introduce simultaneously switching noise (SSN), which has the potential to cause bit errors. In addition, improper termination or board design can lead to poor signal quality due to crosstalk, signal attenuation, noise, etc. All of these factors adversely affect the system's performance and reliability. So, proper board design is key to building robust memory interfaces. Here are some basic board-layout guidelines for memory interfaces:

Match trace lengths to avoid skew between signals.
Route DQ, DQS, and CLK at least 30 mils away from other signals to avoid crosstalk.
Use one 0.1-µF capacitor per two termination resistors.
Implement precision resistors (within 1% to 2 %).
Use an integrated V_TT regulator specially designed for DRAM V_TT.
Route V_REF at least 20 mm away from other signals.
Shield V_REF with V_SS on one side, and with V_DDQ on the other side.

Furthermore, SSN can be minimized by selecting the right I/O placements, using programmable power and ground pins, slowing the I/O slew rates, and selecting the right decoupling scheme. In a worst-case scenario for a one-DIMM system, as many as 81 drivers (64 data, eight ECC, and nine strobe signals) may be switching states on a memory module. An additional 28 signals may be transitioning on the controller at the same time in a pipelined access.

Traditional methods for providing decoupling involve placing capacitors in convenient locations, based on the routing of the board, and applying some predetermined ratio of capacitors to driver pins. Unfortunately, the higher switching speeds of today's DRAM may render such typical ratios less useful. The critical limiting factor in designing a decoupling system is usually not just the amount of capacitance, but also the amount of inductance in the capacitor leads and the vias attaching the capacitors to the power and ground planes. V_TT voltage decoupling should be made very close to the parallel pull-ups on the motherboard. Also, the decoupling capacitors should be connected between V_TT and ground.

It's important to strictly follow the board-design guidelines provided by memory and FPGA vendors. To ensure first-time success of the memory-interface design, thorough signal-integrity analysis must be performed on a system level. Tools like HSPICE, SPECCTRAQuest, XTK, and HyperLynx are solid options for signal-integrity analysis. Another recommendation is that designers use demonstration platforms to verify designs before porting them to their systems. This debugging stage is crucial in achieving first-time design success. FPGA vendors provide demonstration platforms and specific design guidelines for interfacing memory with their FPGAs.

TIMING CHALLENGES High-speed memory-interface design can take a lot of time, with numerous functional and timing requirements to be met. Minimizing clock jitter, channel-to-channel skew, duty-cycle distortion, and system noise all play an integral role in increasing the available timing margin, which in turn improves system reliability under all operating conditions. In addition, the DRAM state machine must be correctly implemented, and care must be taken for proper initialization and refresh of the DRAM cells.

Designers need to perform thorough verifications to ensure that the design meets all timing and functional requirements. Four categories of timing analysis must be performed: write data timing, address and command timing, read capture using DQS, and the resynchronization of captured read data to the system clock domain. For system-level verifications, behavioral models of DRAMs can be obtained from Denali Inc., the de facto memory-model provider to the industry.

To simplify the memory-interface design process and reduce design-cycle time, it's recommended that designers use memory-controller IP cores provided by FPGA vendors or third-party companies. Today's IP cores come with an easy-to-use graphic interface. They're parameterizable so designers can build a controller that fits their system requirements. For example, our DDR SDRAM controller core lets designers customize the controller to meet specific interface requirements, including clock speed, data bus width, number of chip selects, and memory properties.

In summary, high-speed memory interfaces are challenging to build, and designers need to consider several factors before designing these interfaces. Detailed timing analysis should be performed, and system-level verification is a must. Good quality memory-interface support alleviates design challenges and speeds up the design process. Selection of an FPGA for designing memory interfaces requires a thorough understanding of the hardware features supported in the FPGA and the support structure surrounding it. Memory IP controllers, software and tool support, simulation models, demonstration platforms, and good documentation are all critical for memory-interface design.