Don't Be Afraid Of Debugging Symmetric Multiprocessing Systems

The day when you'll be responsible for designing a system involving symmetric multiprocessing (SMP) may be closer than you think. Clearly, multiprocessor systems are becoming more commonplace. And, having the up-front knowledge about what hooks should be designed into your system, and what it will take to get your design through the debug and verification phase will go a long way toward making your efforts in this new and exciting area a success.

Unquestionably, SMP systems can be intimidating at first glance, with endless complexities and possibilities for failure. Fortunately, traditional debugging tools are still quite useful for multiprocessing systems. If you know how to use them properly, you can track and discover the root causes of most difficult problems.

Understanding the need for debugging your SMP systems before you begin your design can greatly improve effectiveness. If you know which tools you'll need for debugging, and the design hooks into your hardware required to use them effectively, you will be able to predict accurate schedules and budgets, and possibly even cut down on debugging time.

For example, your logic analysis system will need all of the features used in a single processor system, as well as processor-specific probes designed for multiprocessing systems. These probes have new features designed to find problems like transaction tracking, interrupt handling, I/O and memory access difficulties, data corruption, and cache coherency problems. Also, deep acquisition memory and time-correlation between multiple-analysis systems are required for a multiprocessor debugging setup. Advanced methods of displaying and manipulating the information are useful for handling the large amount of data acquired in the process.

Processor-specific run-control tools also have the ability to operate symmetric multiprocessing systems. The SMP run-control tool needs to be able to control how the processors run, stop, and step in a way that will give the most information with the least amount of interference and intrusion. The ability to set breakpoints on each processor using registers is required, and the ability to read or write memory through a specific processor in the system can help solve coherency problems.

Through answering some of the tough questions that come up in debugging SMP systems, this article will concentrate on the shared-memory multiprocessor design most recently used in the Pentium Pro architecture. In this design, all processors share a common bus with other agents, using arbitration to decide who owns the bus at any time. Physical memory and access to I/O is identical for each processor, although each processor has its own first and second-level cache. If the processors didn't have their own caches, the system bus would quickly become overloaded with memory requests. Unfortunately, this means that some information never reaches the bus, requiring caching to be turned off. However, turning off caching can hide some of the problems you are trying to debug (fig. 1).

Setting Up For DebugHow do you set up a logic analyzer and run-control system for debugging a multiprocessing system? Only a single logic analyzer and probe are needed for the system bus because it is shared by all the processors in shared-memory SMP systems. This setup is the same as the one used for debugging single-processor systems. But, additional logic analyzers may be needed for probing additional buses for I/O or memory.

The logic analyzer probe attaches at a single processor socket. Because the bus on a symmetric multiprocessing system is shared, most information that you might want to collect from any processor will be available at the logic analyzer probe, no matter where it's physically attached. This simplifies the connection to the system bus and minimizes the additional electrical load. SMP processors will have some signals that aren't shared, so you may need to select a specific processor if you are trying to sample these signals.

Only one run-control system is required, because the processors share a single debug port. A debug port needs to be designed into the target system so the run-control setup can communicate with it. If you can't design in a debug port, you may be able to purchase interposer cards that have debug ports on them. In multiprocessing systems that use JTAG for their debug ports, all the processors and other devices on the board are connected into a single chain that can be accessed through the debug port. It is important that your connections are correct, otherwise damage may result when the processor probe is attached to the system.

The run-control software needs to know which devices (and their order) are on the debug port to interpret the data correctly. This information is known as a "scan chain." In some run- control systems, the scan chain is determined automatically by sending identification queries to all locations on the chain that can contain an agent. If you allow the run-control unit to "autodetect" a scan chain, make sure you know that the operation is safe. Some devices may not be built to the correct specifications of the scan chain, and could be damaged by the commands used while autodetecting. Make sure that you know all of the devices on your board, then your run-control system can detect those devices without damaging them. On most modern Intel-designed boards, the only devices on the scan chain are the processors, which are safe to autodetect. Again, if you did not design the board, and you aren't certain, contact the board's designer or manufacturer, and ask them what your scan chain is.

Some multiprocessing systems may have more than one group of processors, each on a different bus. These groups can have as many processors as their bus can support, plus one agent on each bus which handles the communication with other buses. When setting up a debugging system here, you will need multiple analyzers and run-control units, one hooked into each group. On Hewlett-Packard systems, for example, you can use BNC cables to link the two run-control systems together (connecting the Trigger Out of one to the Break In of the other) and break the entire system. You can either use BNC cables to attach two separate analysis systems together, or use two separate cards to create a cross-trigger to correlate traces on both buses (fig. 2).

Running And SteppingWhat does it mean to "Run Until" and "Step" in an SMP system? In a single processor system, Run Until and Step are used to control the execution of the system so you can see what changes small parts of code make to the system. Step advances a program one instruction at a time, and is usually supported in the processor itself. Run Until advances the program to a predetermined point usually set as a breakpoint in the processor. Your run-control system can be used to Step, or to set breakpoints as locations to Run Until. In some processors that don't have the Step function built in, a simulated Step can be made by setting a breakpoint on the instruction after the current one.

In a symmetric multiprocessing system, both functions are more complex. Ignoring the possibility of finding the processor your program is running on for the moment, what do you do with the other processors while the processor you have selected is stepping? There are two usable possibilities: Run them or Stop them.

Running the processors while stepping the selected processor is the closest you can come to the normal operation of the system. In many cases, this is the only option available. For example, other processors may be handling an interrupt request; writing to memory that is vital for your primary processor's operations; or even performing a critical, uninterruptible operation. On the Pentium Pro system, the boot processor will mark other processors dead if they haven't notified it that they have started by a specific time. Stopping those processors while stepping the boot processor will cause them to be marked dead. In any of these cases (and many others), out-of-sync processors could either crash the system, or hide the problem you are trying to find.

There are some cases where it is better to stop the other processors. By stopping them, you can ensure that only the target processor is performing operations. There are a few common situations when this is useful. One is when an operating system runs a single process on a single processor. You can keep track of a particular program's activities without it being modified by another processor. Additionally, you can use this mode when you need to examine the activity of a single processor for a few steps. One caveat, though, your system may be unstable after operating in this mode.

No matter what you do while you are stepping a processor in an SMP system, it is possible that your Step will cause the processors to go out-of-sync. Figure 3 shows a four-processor system that was stopped when T=1. Processor one was then stepped while the others were run. Processors two and four were in the middle of executing an instruction when processor one completed its Step, and since they can't be stopped in the middle of an instruction, they continued until they could stop. When the system was run again, a snapshot at T=4 shows that the processors never got back in synchronization. In some operating systems, the timing of the processors is crucial. A processor that is ahead or behind where it should be could miss a vital piece of information, or send needed data too late. Luckily, most operating systems aren't that picky about the synchronization of the processors.

The Run Until problem is more complex. The goal of Run Until is to stop the execution of a process when it reaches a certain point. Unfortunately, it is difficult to tell which processor will execute the process. You can set either a hardware or software breakpoint on each of the processors where you want to stop, and set your run-control tool to break all of the processors when any of them break. At this point, you can run all of the processors. When the system stops, one of the processors will be stopped at the specified location, and the others will be stopped where they would be during normal operation.

The single processor concepts between Step and Run Until do not translate perfectly into SMP, but with the proper tools you should be able to modify these ideas to work in your symmetric multiprocessing system.

Watching The ProcessorHow do I observe activity for a given processor? Unlike single-processor system buses of the past, most SMP buses are organized using transactions. A processor interacts differently with its surrounding components when transactions are used instead of cycles. For example, in a simple processor's read cycle, an address is transmitted, and the memory system returns the requested data. This cycle can complete in as few as two system clocks. SMP transactions can be much more complex. The same transaction on an SMP system starts with the processor grabbing the bus through arbitration, initiating the read request with an address, checking for cache coherency through snooping, receiving the data, and finally waiting for an indication that the transaction has completed successfully. Error detection and correction may require additional time.

The key difference between single- processor and SMP buses is that transaction-oriented buses tend to have features that allow outstanding transactions to be heavily overlapped. A four-way SMP system bus might allow eight or more transactions to be outstanding at any given time. In contrast, single-processor cycle-oriented buses only allow one cycle to execute at a time. These systems often allow the last cycle to complete on the first clock of the new cycle. These bus design optimizations are potentially complex, but pale in comparison with their SMP transaction-oriented cousins.

Why all the complexity? The reason is simple— bus bandwidth. When just one processor is on the bus, its only competition is with a DMA agent or, maybe, a smart I/O agent. With SMP systems a single processor will have to compete with other processors and agents, requiring transaction-oriented buses.

Now, with all that said, you should be able to appreciate the difficulty in tracking a specific processor, let alone a specific transaction, on the system bus. Fortunately, there are ways to handle this. Agents on the SMP system bus, such as the processors, the memory controller, and the I/O controller, have to be able to unravel the bus. Logic analyzer probes and special processor-specific software can unravel the SMP bus in the same way.

Where and how the unraveling occurs affects what kind of measurements you can make. There are two options to unravelling the bus — do it in hardware or do it in software. The hardware approach's main advantage is that triggering on a specific agent, or transaction, or sequence of such events is simple. Unfortunately, this approach is expensive and time consuming to develop. Triggering with software is more difficult and limited, but the solutions are often more flexible and available sooner.

Let's explore the software-tool approach because it is cheaper, and appears long before hardware tools are available. It should be noted that even mainly software-based tools require processor-specific hardware. The probes themselves can be very tricky to design due to system-bus loading constraints. Extra signals and hardware tracking often are needed to unravel the bus with postprocessing software. Luckily, these extra signals can make other useful measurements possible.

The software approach uses the captured trace of bus activity and generated signals to unravel the transactions and display them as aligned transactions. Figure 4 shows a waveform display of the raw bus activity (a) and a listing display of the unraveled transaction (b).

The waveform view shows the bus exactly as it occurred in time relative to the system clock. This is useful for understanding and debugging low-level bus issues like transactions that terminate early or not at all.

The listing view shows complete transactions in an "unraveled" format. Unraveling is when the listing post- processing software processes the raw data into complete transactions. This is useful for observing and debugging high-level issues like functional and addressing errors, and data-corruption problems.

Triggering on simple events like a memory write to a certain location, or an I/O read from a certain device are simple to set up with the software- based approach because transaction address and type information are uniformly given in an early phase of most transactions. Triggering on specific data values for certain transaction types can be much more difficult, if possible at all. To track these types of measurements, the approach used in "How can I look at auxiliary buses?" described later, is much more straightforward.

Load BalancingHow can I track load balancing for the different processors and quantify system performance? A useful set of measurements involves statistical data about the target SMP system. The information about how different processors are loaded, how much bus bandwidth is idle, or how a change to the system affects the system performance can be obtained through a logic analyzer and probe, in conjunction with system-performance-analysis software.

Statistical data can be gathered by capturing states that provide processor-identification and cycle-type information (Fig. 5). System performance analysis, specifically processor utilization distribution, can be used to observe distributions of bus events (fig. 5a). It also can be used to quantify system-performance variation based on hardware changes. For example, it could show what happens when a replacement I/O device is accessed.

Figure 5b provides a similar view of the different transactions occurring on the system bus. This information can be used to identify performance and functional problems. For example, if 95% of the transactions are data writes, a software problem might exist—or at least software will need to be tweaked for performance. A hardware problem may be present if 95% of the cycles are retry transactions.

The measurements shown here were obtained by using the logic analyzer and analysis probe to capture only the phase of each transaction where the cycle type, or processor ID was present.

In the first case, #bqual was generated by the logic analysis system to detect the Request B phase. Then, #bqual was used, so only processor identification information was gathered. The trigger system was set up to only capture states when #bqual was asserted. This represents the Request Phase B of the transaction, where the processor ID information was valid. (For discussions on Request Phase A, B, and other Pentium-Pro-specific data, refer to Pentium Pro Processor System Architecture, by Tom Shanley, Addison-Wesley Developers Press, ISBN: 0-201-47953-2.) Applying symbolic information to the appropriate combinations of these signals provided a higher level view of the information. Using a system-performance-analysis (SPA) tool with the captured data gives the result shown.

The second display in Figure 5 was obtained in a similar way. The trigger specification was altered to capture only the Request A phase of each transaction, instead of the Request B phase. The Request A phase provides enough cycle type information to obtain the second display. More statistical accuracy can be obtained by running these measurements repetitively and accumulating data with each run. Even greater accuracy can be reached by utilizing deep analyzer traces.

Tracking Data CorruptionHow can I look at auxiliary (e.g., I/O, PCI, EISA, UMB) buses to track data-corruption problems? Monitoring auxiliary buses in an SMP system is critical to effectively debugging such problems as when the data is correct at one point in the system, but corrupted as it moves to another bus. Using a logic analyzer to observe data on one bus, and then on a subsequent bus, is a great way to track down these data corruption problems.

The PCI and system buses make good examples for this point. Suppose the data from an I/O write in your system ends up at the target I/O device incorrectly. The problem can be tracked by observing the transaction on the system bus and the PCI bus. Each bus is probed and connected to its own analyzer. Then, the measurement is time-correlated so that the resulting displays show the activity on both buses as intermingled.

A trigger on the system bus analyzer turns on the system bus I/O write. A similar trigger is set up on the analyzer connected to the PCI bus. Observing the acquisition should verify the problem and identify the trouble bit(s). An oscilloscope can be added to the problem signals and triggered by the Arm Out signal of the logic analyzer so that the analog characteristic of these data lines can be investigated.