Use DSOs To Catch Elusive Bugs In High-Speed Digital Electronics

High-speed digital systems normally require system characterization and debugging, two of the most time-consuming aspects of the product development process. Development schedules can easily slip, due to unpredictable system integration problems. Thus, a need has developed for the test and measurement features available with today's digital storage oscilloscopes (DSOs). The DSO's powerful signal capture, display, and analysis capabilities can be used to effectively debug and characterize signals, simplifying the processes and reducing time-to-market.

When selecting a scope for debugging and characterizing digital systems, it is important to consider the scope's ability to not only capture and faithfully display the signals of interest, but also to analyze them. The latter aids in solving problems by speeding up the debug and characterization process.

Modern DSOs provide a broad range of capture, display, and analysis capabilities. These include a variety of trigger types, color-graded and persistence (analog-like) displays, automated parameter measurements, histograms, trending, pass/fail testing, and sequential capture with time stamps. For effective signal analysis, it is also important to consider the DSO's processing speed and memory.

Two examples demonstrate the use of DSOs to debug and characterize digital systems. Traditional, as well as new techniques, now exist as a result of advancements in the capture, display, and analysis of digital signals using DSOs.

The test results of an early production run of an Intel 386DX-based controller board indicate many instruction-cache diagnostic failures. Some boards successfully pass the cache diagnostic test and subsequently fail when running the application program, while others always fail. All boards run properly when their caches are disabled. Due to the diagnostic program's limitations and the nature of the failures, it is not possible to determine the reason for the failures.

A simple debug program that exercises the bus is run on a defective board. The debug program successfully loops, and intermittent failures do not occur. An analog or digital scope may be used to observe the cache control signals: write enable (—WE), chip enable (—CE), and output enable (—OE). Observations of the control signals—on all nine cache chips—appear to be of normal shape, duration, and level.

Using a DSO, the basic bus cycle is confirmed and found to be operating normally. Figure 1 shows (from top to bottom) the 66.7-MHz CPU clock, the address/data bus stable flag (— ADS), —OE, and the cache —CE. A DSO is used, since it can be more effective in viewing nonrepetitive signals than an analog scope.

The bus logic appears to behave normally. This points to an address or data problem. Two critical timing paths are associated with the cache controller: one for the cache address latch enable, and one for the SRAM output enable (Fig. 2).

For simplicity, a persistence display is used to look for anomalies. Figure 3 shows the bus clock and one of the cache data lines, displayed using an LC574 DSO in color-graded persistence mode. In this display mode, the most common events are indicated by the hottest color—that is, red. As the figure demonstrates, persistence displays can be difficult to interpret when viewing bus signals. The data bit displays complex behavior, including undershoot, "runts," and edges with different rise times (which makes the clock appear to jitter in the display).

At first glance, it would appear that there are a variety of timing problems. Closer examination, however, reveals that the display actually shows normal system behavior. The runts appearing near the rising edge of the data line are approximately one CPU clock in duration—a perfectly legal condition, considering a complete bus cycle requires two CPU clocks. The undershoot, while undesirable, is well within expected values.

The CPU data bus is multiplexed, making it impossible to tell which device is driving it at any given time. Variations in propagation time from one device driver to another can easily explain the observed 2 ns to 3 ns variation in fall time. The larger rise time variations are most likely caused by timing differences between cache and CPU cycles.

In general, viewing digital signals by using a persistence display, without the benefit of a qualifying trigger, provides little useful information. In this case, observing the cache data and address lines, using a persistence display while triggering on the cache output enable, proves to be more effective.

Figure 4 shows (from top to bottom) the clock, data line D6, data line D4, and —OE. Using the scope's relative time cursors, a 3-ns delay is measured between the crossings of the two data lines.

Additional measurements indicate that the falling edge of data line D4 is slower than all of the other cache data lines. The slow edge may be the cause of the failure, since the cache's SRAM chip requires a minimum —OE-to-data delay of 10 ns to meet the microprocessor's setup time spec.

The accumulated results lead to a conclusion: there is a cache-read problem, due to a defective SRAM. But, replacing the SRAM does not solve the problem. This should have been expected, because more than one board is failing in the same way.

The signals are viewed again using single-shot acquisition. The timing is analyzed using the scope's automated measurement parameters.

The Δt@lev parameter is used to measure the delay from the falling edge of the cache-output enable to both the rising and falling edges of D4 and D6 (Figs. 5 and 6). The Δt@lev parameter can accurately measure time delays between two signals by independently selecting a reference level, edge (positive, negative, or first edge), and hysteresis setting for each signal.

Automated parameter measurements are performed on the signal region lying between the cursors. This provides the capability of measuring the timing on selected cycles.

If the cursors are moved from one cache-read cycle to the next, the measurements show a small difference in rise time between two data bits. The falling edges of the data bits are of great interest, because they exhibit a significant difference of more than 3 ns. Oddly enough, it is also apparent that there is no undershoot on D4 during cache reads.

It often proves helpful to quantify, and develop an understanding of, the repetitive nature of the problem. Modern scopes provide powerful statistical analysis capacities. In addition, they have the processing power to collect automated parameter measurement statistics on selected parameters over many sweeps, as shown in Figure 7.

Over time, the average difference in fall time between the two data bits is approximately 3.8 ns (11.82 - 8.04). The standard deviation of the measured delay is relatively low, indicating that the cause of the problem is very stable.

Based on the visual, parametric, and statistical results collected, it appears that the problem is due to the pc board itself. Sure enough, a check of the trace for D4, between the SRAM and the CPU, indicates an impedance of greater than 200Ω! This high value explains both the extended fall time and the reduced undershoot. A jumper wire is used to bypass the suspected trace, and the failure is eliminated.

A microprocessor-based system, similar to the one in the first example, fails intermittently and generates sporadic DRAM parity errors. The failures appear to be unrelated to the operating mode, and the memory diagnostic indicates seemingly random single- and multiple-bit errors.

The DRAM data and address buses are examined at the various failure points indicated by the diagnostic. A logic analyzer is used to verify important control signals, while observing the buses. Yet, there is no indication of any logic-related problems. Repetitive reading and writing to the failing locations does not provide any additional insight.

The evidence thus far leads to the conclusion that the failure may be noise-related. To gain additional insight, the multiplexed address bus is examined using a DSO, which can better measure the characteristics of the non-repetitive bus signals.

An LC574 scope, with LeCroy's Smart Trigger capability, is used for further investigation. The interval trigger mode is selected to trigger the DSO when the interval between negative edges on the address bus is less than or equal to 55 ns—slightly less than two bus clocks. Address signals will not normally change states in less than two clock cycles unless there is a problem, which will trigger the scope. With this trigger setup, each address line is probed until one triggers the scope—indicating a potentially problematic address line.

Figure 8 shows an Analog Persistence display of (from top to bottom) the 33-MHz bus clock, the low-asserted column address strobe (—CAS), address bit A1, and address bit A0.

The unusual perturbations on address line A0 (shown in trace 4) appear to be a problem. The perturbations, or bumps, occur when —CAS is asserted As shown by the amplitude cursor, they appear to exceed 2.0 V in some cases. Bit A1 (shown in trace 3) exhibits proper behavior.

LeCroy's Analog Persistence mode displays the events occurring most frequently with the highest trace intensity, as well as those occurring least frequently with the least intensity—similar to an analog scope display. The problem is easily seen with this approach, but the cause is not yet understood. Whether the observed glitches are a result of transmission-line effects, ringback, coupling, or another factor must still be determined.

The appearance of glitches prompts the decision to use a glitch trigger, with suitable values for pulse duration (<12.5 ns) and level (0.5 V). While it is difficult to maintain a stable display by triggering in this manner, the use of the sequence mode makes it possible to capture a number of signal segments for closer examination.

Sequence mode, along with this DSO's long record length, enables the capturing of up to 2000 segments in single-shot mode, minimizing acquisition dead time. The data, as well as the trigger time for each sequence, is stored, providing valuable debug information. Each segment contains glitches that satisfy the glitch trigger condition (Fig. 9).

Zoom trace A is used to closely examine each of the captured segments to determine which ones contain the "problem" glitch. Note that the peak of the highlighted glitch is over 2 V.

The DSO's status display indicates the time at which each of the segments was acquired (Fig. 10). From the times displayed, a correlation is immediately apparent. Many of the segments occur at relative times, which are multiples of 16 ms—the DRAM refresh period. Based on these results, the DRAM controller is carefully evaluated and replaced, eliminating the problem.

Most scopes with adequate bandwidth may be used to debug simple timing problems. Debugging nonrepetitive bus operations requires the single-shot acquisition capability of a DSO, with sufficient sample rate to view signal details.

Color-graded, as well as persistence displays, can be used effectively to view anomalies on digital buses. And it would be difficult to characterize glitches and infrequent and unique events without the powerful triggering capability of a modern DSO. Also necessary are automated parameters measurements, parameter statistics, and time stamping.