Speed Time-To-Insight Debugging For High-Speed Memory Failures

March 3, 2005

11 min read

Intermittent memory failures can be perplexing to debug. The root of such failures may stem from one cause or a combination of different causes, including BIOS errors, protocol errors, signal-integrity issues, hardware failure, and memory or other subsystem problems. Though some engineering teams achieve rapid resolution to debugging memory failures, many teams experience frustration and flounder while debugging intermittent failures.

This article outlines a debug methodology for approaching intermittent memory failures. These methods are applicable to DDR, DDR2, and the SDRAM side of Fully Buffered DIMM system debug. The actual probing solution will vary depending on the connector in use, or if the memory is embedded. Examples are provided to illustrate how different root causes of memory failures were uncovered. Engineers debugging systems that don't boot or that fail memory tests repeatedly also will benefit from the debug methods outlined here.

The basic debug methodology for determining the root cause of an intermittent memory failure goes through three stages. First, determine if the failure is repeatable. Try to duplicate the conditions of the failure; doing so often provides valuable insight into the nature of the problem. Second, connect a logic analyzer to the memory bus with a probe or interposer to gain rapid insight into the timing relationships of the entire DDR2 bus, parts-per-million errors, clock quality, and protocol errors. Third, conduct parametric measurements using a high-performance scope with high-bandwidth probing. This involves probing at the receiving end of signals. To capture WRITE bursts to memory, probe at the SDRAM. To capture READ bursts from memory, probe at the memory controller.

Stage 1 When attempting to recreate failure conditions, keep in mind that the root cause of the problem can come from a subsystem or from applications that aren't directly connected to memory. Local-area-network (LAN) access, power sequences of subsystems, entering and exiting sleep modes, and power cycles can be important to consider when evaluating memory failures. Crosstalk and conflicting resources from various subsystems, modes, and cycles also have led to many intermittent memory failures.

Isolating a problem during a specific test or set of conditions makes the problem easier to evaluate. For instance, failure during a specific test could point to the software routine, or to signal-integrity issues like crosstalk or intersymbol interference. With a repeatable failure, you can take multiple measurements under the conditions of the failure.

Duplicating the conditions is often easier said than done, though. Details to consider include reviewing error logs and identifying what software was running at the time of failure. Note the BIOS, operating system, and applications running prior to the failure. Environmental variants also can affect system failures. What was the room temperature when the system failed? Check the airflow to the system.

Hardware considerations are numerous. Is the power to the system within specifications? Has a system of this same design ever passed validation tests? Do other systems fail, or is this failure unit-specific? What are the revisions on the board, DIMM, processor, or other components of the failed system? How does the failed system differ from working systems? Have there been recent component changes in manufacturing?

If conditions are repeatable, run your tests under those conditions. If not, choose a robust memory test and vary the test conditions, such as temperature and power-supply limits, in a methodical manner.

Stage 2 For debugging DDR systems, a logic analyzer complements high-speed oscilloscopes. A logic analyzer with a DDR probe, interposer, or direct attach provides rapid insight into system problem areas across the entire DDR bus. While the logic analyzer lacks the resolution and analog measurement capabilities of a scope, the ability to view all signals on the bus relative to each other offers designers immense value.

You can save time by narrowing down problem areas quickly with the logic-analyzer tools. After determining suspect signals with the analyzer, use a high-performance scope to inspect the problem in more detail. Logic-analyzer systems offer up to 64M-deep state traces, with protocol decode to translate commands for functional validation. Simultaneous to the state capture, there also are 64k-deep traces of high-resolution timing analysis across the entire DDR bus in one easy connection (timing zoom). The viewing area of the 64k trace depth is adjustable about the trigger from 100% pre-trigger to 100% post-trigger.

High-resolution eye diagram measurements on logic analyzers make it possible to identify parts-per-million errors. Eye measurements also supply insight at a glance across all signals sampled by the Command Clock, CK0/CK0#. (Using eye measurements on data signals is somewhat more complex with three-stated strobes and a shift in the setup and hold times between READ and WRITE cycles.)

Global markers (up to 1024) can be set automatically from search functions. The global markers track between waveform and listing windows to allow for different views of suspect areas.

A post-process software-analysis routine can perform the same measurements as done manually with global markers on the logic analyzer. Designers can write their own programs with software-analysis options available to the logic-analyzer application. Software-analysis tools offer additional insight and automation of repetitive tasks and measurements. Thus, statistically valid results become a reality with automated measurements.

The software application can determine if one or more data bits have consistently marginal data-valid windows. Also, the location of the smallest data-valid window is recorded. For example, a particularly insightful automated measurement on the timing zoom informs us that the average data-valid window of data bursts was 2.1 ns. However, there were four 500-ps data-valid windows. Such a dramatic variation in data-valid windows is a concern even though the data-valid window could be within specification, given the resolution of the timing zoom measurement.

Having identified the smallest data-valid windows, you should validate the actual data-valid window with a high-performance scope. On the logic analyzer, you can more closely observe the suspect area for patterns and relationships in the situation involving the worst-case data-valid window. You will want to check which bank and address was being read from at the time of the smallest data-valid window. Was a certain data sequence apparent during each of the suspect data transmissions? The more information we uncover, the quicker we can resolve the root cause of the problem.

Measurements of interest from Timing Zoom traces, such as the one shown in Figure 1, include the clock period, refresh rate, precharge interval, and relative measurement of data-valid windows comparing different data bits. Use markers or simply mouse over a trace for a flyout of the transition width.

Also available is row-access strobe (RAS) and column-access strobe (CAS) latency, measured from a valid command (rising edge of Command Clock CK0, with CS low, during a WRITE/ READ command) to the rising edge of the first data strobe during the data burst. Furthermore, there's RAS and CAS delay, measured from a valid Active (the rising edge of Command Clock, S0 = 0, with command = Activate) to valid WRITE/CAS. All of these measurements can be automated with an advanced customization environment (e.g., the Agilent B4606A on the 16900 series logic-analyzer systems).

A noticeable area of concern in Figure 1, identified with markers, is that S0 (Chip Select) is occasionally enabled within 250 ps of the rising edge of CK0 (Command Clock). This is a possible violation of the setup-and-hold-time specification, T_SETUP/T_HOLD (T_S/T_H). To verify the setup and hold time correctly, we need to probe CK0/CK0# and Chip Select at the SDRAM with a high-speed scope and probe. Note that marginal T_S/T_H for any signal can lead to intermittent or consistent memory failures.

Before we break out the scope probes to characterize the T_S/T_H of S0, we can use eye-diagram measurements on the logic analyzer to further evaluate marginal timing relationships. With eye scan, you can identify parts-per-million errors that would show up as speckles inside the eye. In the example shown in Figure 2, there's no evidence of parts-per-million errors. But other useful information can be gleaned.

Looking again at the Figure 2 eye measurement, CK0 is the white square wave, and S0 is the triangular wave forming an eye around the rising edge of CK0. Slow rise time on S0 might be the cause of intermittent system failures in this system. These slow edges degrade the eye and decrease T_SETUP.

The system from Figure 2 requires that an oscilloscope do the final characterization of T_SETUP for Chip Select. Our next example of rapid memory-system insight with a logic analyzer shows how adding colorized filtering offers the ability to perform an overview of memory access through pattern recognition, yielding rapid insight into protocol errors.

In this example, a colorized filter on the logic analyzer is set up to help locate closed-page violations, where a READ or WRITE command to a bank isn't initiated with a bank activate. Color filters were set to provide shades of red for Bank 0 (B0) and blue for Bank 1 (B1); hot pink = B0 activate and red = B0 READ; turquoise = B1 activate and light blue = B1 READ. Color filtering lets engineers use pattern recognition while viewing waveforms to recognize areas that require further investigation.

In Figure 3, B0 was activated (hot pink) prior to a series of READ commands to B0 (red). However, there's no B1 activation (turquoise) prior to the READ from B1 (light blue) on the right of the screen. This indicates a problem if the system is following the closed-page policy and only allows one open bank at a time.

Our last example of logic analyzer tools also involves using eye measurements. Eye-measurement tools supply a single-voltage-threshold eye diagram of signals relative to the clock edge referenced at 0 s for +5 ns and -5 ns. They provide insight at a glance regarding clock duty cycle, noise and signal-integrity issues, data-valid windows, eye closure, and channel-to-channel skew. Eye measurements are the fastest method for calibrating the logic analyzer's sampling position.

In Figure 4, the upper screen shows eye-finder results on a system with a clean clock. From the eye-finder results, we notice that the duty cycle of CommandClk is 50% as evidenced by the equal-size white areas (eyes) either side of T = 0. A slender transitional area (yellow) for CommandClk at T = 0 indicates a clean clock edge.

The lower screen shows a DDR system with a dirty, or noisy, clock. We determine that the clock is dirty by looking at the eye finder results. The transition area of CommandClk is wide, and the single-ended eyes sampled off of CK0 and CK0# aren't symmetrical. Asymmetrical eyes also could indicate that the logic analyzer threshold is incorrect.

Stage 3 Once the problem is narrowed down to suspect signals, parametric measurements with a high-speed oscilloscope and probing system are often required to determine the root cause of failures. Logic analyzers don't have the resolution required to characterize a DDR system.

For DDR2 measurements, using a 20-Gsample/s, 6-GHz scope with 7-GHz probes yields accurate measurements for system characterization. Then there's a newer memory technology called a fully buffered DIMM (FBD). Although it uses standard DDR2 SDRAM, it routes the traffic into an advanced memory buffer (AMB). Then the data is sent out on a 4.8-Gbit/s bidirectional link. So to characterize the channel side of a fully buffered DIMM with your scope requires a minimum of 10 GHz of bandwidth, with 13 GHz being the ultimate.

Typically, characterization measurements include T_S/T_H, rise time, clock overshoot, frequency, and jitter. Jitter-analysis packages are particularly useful when digging into "dirty clock" problems. Eye measurements, eye masks, and eye-unfolding software provide detailed insight into signal behavior.

Probe placement is critical for making accurate parametric measurements for signal characterization. Probe READ data and strobes at the memory controller, and probe WRITE data and strobes at the SDRAM.

Figure 5 is an eye measurement of DQS0 relative to the rising and falling edges of DQS5 at T = 0. The measurement was taken at an interposer in the DIMM slot. Note that the eye for WRITE strobes is large and well shaped. The probe location on the interposer is close enough to the SDRAM so that the signal is clear of reflections.

The READ strobes are degraded from reflections at the interposer. The eye is adequate for relative measurements of pulse width. The position on the bus is inadequate for actual characterization of the READ traffic, though. Figure 5 also illustrates the importance of probe placement. That's due to the distortion of the READ signal's amplitude when viewed at the interposer. It bears a scrunched resemblance to the actual eye at the memory controller. For an accurate view of the READ data, as seen by the memory controller, miniature scope probe tips must be placed at the memory controller.

Many memory-technology leaders validate and debug high-speed memory systems using the tools and techniques described in this article. Engineers who embrace these time-saving tools reap the rewards of faster debug and greater insight into system performance.