Today’s digital systems, from the video game console to the complex switching elements in a communications network, rely on serial bus technology to do their job. Not surprisingly, a host of application-specific serial buses has emerged.
Serial ATA handles communications between chipsets and disk drives. HDMI manages data going from digital audio/visual (A/V) sources to display devices. PCI Express (PCIe), designed to connect peripheral devices in the PC environment, now is showing up in a wide range of applications not served by other specialized interfaces.
In a given electronic system, it is not unusual to find all of these buses coexisting along with several parallel buses. This trend has intensified the need and demand for cross-bus troubleshooting solutions that offer a simple, integrated way to simultaneously view the logical interactions between several different buses.
A variety of solutions exists. One approach pairs a standard-specific protocol analyzer with a logic analyzer. The former takes care of the serial acquisition while the logic analyzer captures parallel bus data that may pertain to the troubleshooting issue at hand. This approach does not provide the capability to perform cross-bus analysis in an integrated platform. However, using a logic analyzer with a bus support package that includes an external interface to convert serial data into the parallel data used by the logic analyzer does offer cross-bus analysis capabilities in an integrated platform.
Increasingly, designers are turning to solutions that integrate both serial and parallel acquisition modules into the logic analyzer mainframe. This allows a mix of PCIe serial and parallel acquisition modules within a single system. With the addition of this serial capability, these instruments can capture, cross-trigger, and display time-correlated parallel and serial data as well as analog waveforms from an oscilloscope on the same logic analyzer screen. This capability is designed to simplify digital troubleshooting.
The underlying architecture of a PCIe serial link is well established. Often embedded as an element within an FPGA, a PCIe transmitter with a serializer-deserializer (SERDES) sends 8-bit/10-bit encoded information to a receiver elsewhere in the system. Transmission impedances, bit rates, and clock characteristics explicitly specified in the PCI Express Base Specification allow interoperability among PCIe components from diverse manufacturers.
Problems with the PCIe link may have either digital or analog origins. Errors on the PCIe link, such as cyclic redundancy check (CRC) or disparity, could be a result of the analog effects of the signal. In these cases, the first step in troubleshooting is to take a snapshot of the analog waveforms at the time of the error.
A logic analyzer equipped with parallel and serial modules and the capability to import analog waveforms provides a comprehensive platform for cross-bus analysis. Figure 1 shows the digital waveform from a PCIe link and an analog waveform from a PCIe channel on the link. The cursor marks the location where errors began occurring on the PCIe link.
In this example, the logic analyzer triggered when errors such as CRC or disparity started occurring on the link. Upon triggering, the logic analyzer has cross-triggered a real-time oscilloscope monitoring a single channel of the link.
The analog waveform depicts a single lane of the PCIe link represented in the digital waveform display above it. The two waveforms are time-correlated.
Theoretically, it is possible to hand-decode the actual binary waveform data to confirm this. Looking at the analog waveform in the vicinity of the cursor, it is apparent that the error is not caused by analog-domain problems such as runt pulses or glitches.
Thanks to the accurate time correlation of the two views, you may conclude that the error on the link does not stem from an underlying analog problem. However, you may choose to further validate that jitter is not the problem by measuring it on the channel using jitter analysis software.
If there had been such a problem, the next troubleshooting steps would rely on an oscilloscope cross-triggered by the logic analyzer to track down the root cause. But the findings in Figure 1 imply a digitally based issue deriving from a timing problem or other digital conflicts. The logic analyzer is well suited to this job.
Designers often include built-in debug ports for PCIe silicon. This parallel output delivers real-time data summarizing the internal states of the PCIe device. With debug ports at the transmitter and the receiver, developers can monitor the health of the link and localize many types of problems to either the transmit or the receive side of the link.
A simple state machine that might be found within a PCIe serial receiver would include interactions such as Idle, Recovery, Transmit, and Overflow. Legal state transitions are Transmit to Idle, Transmit to Overflow or Idle, Overflow to Recovery, and Recovery to Idle.
A test setup for the PCIe serial link would include the transmitter and receiver debug ports. Assuming that this is a troubleshooting routine designed to locate the origin of garbled data appearing on the serial link, the debug ports would be connected to a parallel acquisition module; the PCIe link would connect to a serial module on the logic analyzer.
Figure 2 is a screen image from the logic analyzer acquisition. This view adds the parallel data stream captured from the receiver’s debug port. The new logic analyzer waveform trace includes the hexadecimal values shown in the state machine diagram (bottom waveform).
Look closely at both waveforms at the point where the red cursor line crosses them. Here the link enters the Overflow 001 state. Something has gone wrong. The routine has jumped directly from Idle to Overflow, which is impossible if the state machine is circulating properly through its instructions.
All three traces in Figure 2 are time-correlated thanks to the tightly integrated serial and parallel acquisition modules operating within the same logic analyzer mainframe. In some cases, the serial bus transition may lag behind the debug port output due to latency; that is, the time required for the serial buffer to flush its contents after the state has changed. In such instances, the timing differential visible in the cross-bus view will reflect this latency accurately.
In Figure 2, the yellow portion of the serial trace coincides with the 001 state on the state machine trace. The blue portion of the serial trace’s timing matches up correctly with the E81 Idle state on the debug port. The link is operational and communicating but it is not following its intended routine.
Because the serial data errors coincide with the Overflow state on the debug port and because the serial data is driven by the SERDES, it is reasonable to assume that the problem is timing-related and originates within the SERDES. At this point, there may be several potential troubleshooting strategies influenced by architectural considerations or other debug findings.
Most commonly, serial link features are incorporated into an FPGA. An FPGA is designed to transform itself into functional elements defined by the programmer. This transformation is known as synthesis, since it literally synthesizes the desired functions using its internal gates. Knowing this, the astute designer will troubleshoot the error first by double-checking the FPGA synthesis results to ensure the timing of each state-machine transition is correctly implemented.
If that doesn’t reveal the problem’s source, a second pragmatic step is to route other signals to the debug connector to trace the device’s behavior. For example, after evaluating the Current State data as shown in Figure 2, the FPGA might be reprogrammed to deliver the Next State data to the debug port. This could reveal issues that are not seen in the Current State, and, of course, there are more states that can be investigated beyond that.
Monitoring Three Buses at Once
Debugging a PC motherboard is an example of an environment that requires cross-bus troubleshooting capabilities. It is a complex, sophisticated electronic design. Diverse high-speed serial and parallel buses transport signals among IC components, between onboard subsystems, and to peripherals and storage media. A problem on any one of these buses can manifest itself as an error on an entirely different bus.
In the simplest terms, the logic analyzer must trigger on an error occurring on one bus while viewing the error’s origins in a different subsystem or bus and its consequences on a third. During the development of a motherboard, interactions and dependencies among buses not directly connected can reveal much about the stability of the emerging product.
Consider the following example: A prototype for a motherboard has arrived after fabrication. During the design validation process, it frequently encounters problems in its routine functional exercises. In the worst case, the device freezes and must be rebooted. At other times, the device seems to operate normally, but the display is garbled and unintelligible.
To track down the problem, the processor, the double data rate 3 (DDR3), and PCIe buses on the motherboard would be connected to a logic analyzer for simultaneous serial and parallel acquisition. Interposers would be used as probing attachments that plug into existing board-mounted connectors to extract the desired signals.
A test is created that incorporates a series of READ and WRITE operations, among others. The test proceeds as follows:
• The CPU issues a WRITE command and sends data 13FF to a particular address location (00100000) in the DDR3 SDRAM memory. The instruction passes from the CPU through the processor bus to the chipset and ultimately to the DDR3 SDRAM.
• The graphics card issues a READ instruction to the same address. The command goes over the PCIe bus, through the chipset, and to the DDR3 SDRAM.
• The CPU issues a second WRITE command and sends new data to the same location in the DDR3 SDRAM. Since no other instruction should have modified the data before the graphics card READ, the result of the query should be 13FF, exactly the same data that was written during the first cycle.
But the result is 13EF. The PCIe graphics interface does not act as expected, and an error occurs. What could be causing this problem?
Concurrent monitoring of all three of the buses involved in the transaction proves to be an easy way to track down the problem. The PCIe, DDR3 SDRAM, and processor bus interposers connect to serial and parallel acquisition modules within a single logic analyzer mainframe.
Looking at the transactions from the PCIe bus in their deserialized form, as delivered by the PCIe acquisition module and shown in Figure 3, it becomes clear that the PCIe graphics card is indeed receiving the incorrect 13EF data word. It is appropriately reporting flawed data. Neither the graphics card nor the PCIe bus is the source of the problem.
The next step is to look at transactions on the DDR3 SDRAM bus that, because they produce a display very similar to Figure 3, need not be repeated. A READ operation confirms that the correct address was written. That brings the processor into question. Did it send the data it was supposed to send? Monitoring the processor bus establishes that the correct data was written to memory.
All three buses appear to be doing their jobs correctly. The data is being sent to the desired memory location as commanded by the CPU. The only remaining possibility is a timing conflict of some kind.
One potential suspect is the READ/WRITE timing. Yet the previous steps have established that the CPU is issuing the WRITE at the expected time.
When timing and synchronization problems are suspected, the logic analyzer’s capability to view correlated traces from all three buses is a time-saver. Looking at the memory bus reveals that the READ is preceded by, rather than followed by, the second WRITE cycle, as shown in Figure 4.
The PCIe card receives data stored one operation later than intended. The circled numerals on this timing acquisition correspond to the following steps:
1. Row Open
2 and 3. First Writes
4 and 5. Second Writes
6. Read (should have occurred between steps 3 and 4)
The time-correlated view of the READ, WRITE, and data values on the respective buses uncovers a classic problem: The chipset, designed to act as a traffic director, is not timing the graphics card’s READ request correctly. The READ fails to access the memory after the first CPU WRITE cycle as intended. The chipset is the source of this problem.
Frequently, tracing a system problem involves much more than just following a glitch back to its source in some logic element. An error on one bus may have its origins—and its impacts—on multiple buses in the system.
With the advent of integrated tools that bring time-correlated serial, parallel, and even analog events into view on a logic analyzer screen, designers have a new tool for troubleshooting. Cross-bus analysis makes it possible not only to see simultaneous interactions throughout the system, speeding efforts to track down not just errors, but also to identify their root causes.
About the Author
Sarah Boen is a product marketing professional specializing in serial applications in Tektronix’s logic analyzer product line. Her nine-year tenure at Tektronix has included serving as a logic analyzer product marketing manager and product planner, a program manager, and a software design engineer. She received an M.B.A and a B.S.C.S. from the University of Portland. Tektronix, 14200 S.W. Karl Braun Dr., Beaverton, OR 97077, 800-835-9433, e-mail: [email protected]