Build A Debug And Trace Systems For Multicore SoCs

Savvy cost/functionality tradeoff decisions can lead to effective and efficient debug and trace for complex, multicore SoCs.

William Orme

Aug. 14, 2008

13 min read

Embedded designers put microprocessors in everyday products like cars, phones, cameras, TVs, music players, and printers, as well as the communications infrastructure, which the general public doesn’t get to see. They know how important it is for their products to work—and work preferably better than their competitors’ products.

But the systems-on-a-chip (SoC) behind them continue to grow in complexity, making that simple goal harder to achieve, particularly with the rise of multicore systems. Getting these systems to work well means giving engineers throughout the design and test cycle visibility into what their systems are doing. At the modeling stage, visibility is provided in the modeling tool. Once you move to a physical implementation, though, the designer must include specific mechanisms to provide visibility.

Choosing which mechanism to provide should be a direct response to the needs of the different engineers doing hardware bring-up, low-level system software, real-time operating-system (RTOS) and OS porting, application development, system integration, performance optimization, production test, in-field maintenance, returns failure analysis, and other functions, which need to be satisfied. Although their respective tools may handle and present the data in different ways, they all rely on getting debug and trace data from the target SoC.

TRADEOFF DECISIONS The easy answer is to fit everything and give full visibility to everything happening on-chip in real time. Most processors offer good debug and trace capabilities (Embedded Trace Macrocells for ARM processors, PDTrace for MIPS, Nexus trace for PowerPC, and several DSPs), as do the interconnect fabrics. Also, custom debug capabilities can be added to custom cores.

These capabilities can be integrated at the system level together with systems such as the ARM CoreSight Architecture (Fig. 1), the Infineon TriCore Multi-Core Debug Solution (MCDS), or the MIPS/FS2 Multi-core Embedded Debug (MED). But the costs of such debug systems in IP design time or licensing fees, silicon area, pins, and tools may need strong justification to fit into tight budgets.

RUN-CONTROL DEBUG Almost all SoC designs will need to enable basic run-control debug, where the core can be halted at any instruction or data access and the system state can be examined and changed if required. This “traditionally” uses the JTAG port. However, the number of pins can now be reduced to two (one bidirectional data pin plus an externally provided clock overlaid on TMS and TCK) using technology such as the ARM Serial Wire Debug or Texas Instruments’ spy-bi-wire in the MSP430.

Where boundary-scan test isn’t employed, or separate debug and test JTAG ports are implemented, run-control debug can save two to five pins (TDO, TDI, nTRST, nSRST, and RTCLK). Where boundary-scan test is employed, the redundant pins can be reassigned when they aren’t in test mode. If there’s reassignment to pins for a trace port, it won’t even cast a “test shadow.”

Multicore SoCs that place cores in multiple clock and power domains (mainly for energy management) should replace a traditional JTAG daisy chain with a system that can maintain debug communications between the debug tool and the target, despite any individual core being powered down or in sleep mode.

The CoreSight Debug Access Port (DAP) is an example of a bridge between the external debug clock and multiple domains for cores in the SoC (Fig. 2). It also can maintain debug communications with any core at the highest frequency supported, rather than the slowest frequency of all cores on a JTAG daisy chain.

For designs requiring ultra-fast code download or access to memory-mapped peripheral registers while the core is running, the ASIC designer should connect a direct memory access (DMA) from the DAP to the system interconnect so the debug tool can become a bus master on the system bus.

For remote debug of in-field products or large batch testing in which a debug tool seat per device under test is unrealistic, the designer can also connect the DAP into a processor’s peripheral map. This permits the target resident software to set up its own debug and trace configurations.

A common criterion for embedded-system debugging is the ability to debug from reset and through partial power cycles, requiring careful design of power domains and reset signals. Critically, reset of the debug control register should be separated from that of the functional (non-debug) system. Power-down can be handled in different ways when debugging, such as ignoring power-down signals or putting the debug logic in different power domains that aren’t powered down.

The ability to stop and start all cores synchronously is extremely valuable for multicore systems that have inter-process communication or shared memory. To ensure that this synchronization is within a few cycles, a cross-trigger matrix should be fitted (Fig. 3).

The configuration registers of the crosstrigger interface enable the developer to select the required cross-triggering behavior, e.g., which cores are halted on a breakpoint hit on another core. If, on the other hand, the cores have widely separated and non-interfering tasks, it may be sufficient to synchronize core stops and starts with the debug tools.

Inevitably, this will lead to hundreds of cycles of skid between cores stopping. The synchronous starting of cores can be achieved with either a cross-triggering mechanism or via the test access port (TAP) controller of each core.

Fitting multiple debug ports, one for each core, has obvious silicon and pin overheads. It also leaves the synchronization and power- down issue to be managed by the tools. This approach only has merit in completely different cores with completely different tool chains, where the re-engineering costs of sharing a single debug port with a single JTAG emulator box are substantially higher than the costs of duplicating debug ports and debug tool seats.

It may suffice if two separate systems co-reside on the same piece of silicon, but debugging both systems simultaneously is rare. An example might be an MCU plus a dedicated DSP or data engine, where the DSP or data engine isn’t reprogrammed by applications but by a set of fixed functions developed independently.

Continue to page 2

HOW TO SIZE YOUR TRACE SUBSYSTEM After run-control debug, trace is the next most important debug feature, meaning the passive recording of the system execution while it’s executing. It’s obligatory in hard real-time, electromechanical systems where halting the control system isn’t an option, such as hard-disk drives and engine/motor-control systems. It’s also highly beneficial for debugging any system that reacts with another system (e.g., the real world) in a data-dependent or asynchronous manner, covering just about any complex embedded system.

Trace allows for the capture of errant corner cases that couldn’t be covered by system-validation pre-tapeout. Three other very important use cases for trace are performance optimization of an application, efficiency of software and system development, and accountability (hard evidence as to the cause, and thus responsibility for a product failure).

Choosing the level of trace has the largest impact on the cost of implementing the on-chip debug system. The good news is that the cost per CPU for multicore SoCs can actually be reduced. So, designers must ask who is going to use the trace data, as well as what tool they will use with it.

SOFTWARE TRACE Software executing on the cores generates the simplest and cheapest form of trace. Traditionally, this data was written to an area of system memory, while a separate task emptied and sent back the data to the debug tools via any available communication channel, such as a serial port. Or, more commonly, it was sent over JTAG as the debugger is typically connected to the target via the JTAG port and doesn’t interfere with functional I/O ports.

Recent optimizations on this approach write to a peripheral like the CoreSight Instrumentation Trace Macrocell (ITM) (Fig. 4) or the MIPI System Trace Module, which streams the trace data direct to a trace buffer, with the benefit of minimizing and making deterministic the number of cycles taken to instrument the code. The MIPI System Trace Module also provides a higher-bandwidth channel to allow for more instrumentation points and enables very deep off-chip buffers.

The biggest drawbacks of this approach are the intrusiveness on the application execution time and the limited trace bandwidth. However, it’s a good approach when all target resident software and the debug tools to interpret the trace data are written with this mechanism in mind.

For multiprocessing systems, instrumentation trace has the advantage of understanding its own context, e.g., which thread am I? It also can add a higher-level semantic that’s extremely useful to a software application developer.

In addition, the processor has access to performance monitor registers, which provide valuable system-performance profiling data such as cycle executed, branches taken/mispredictions, and cache hits/ misses. Given the relatively low implementation costs and high potential benefits, instrumentation trace is an obvious candidate to fit in any multicore SoC.

HARDWARE TRACE When more detail is required or code instrumentation isn’t adopted, hardware trace like ARM Embedded Trace Macrocells (ETM) is popular. ETM’s uptake among licensees of ARM11 and Cortex families of processors is greater than 90%.

Hardware trace, such as logic that watches the address, data, and control signals within the SoC, compresses the information and emits it to a trace buffer that can be subdivided into three main categories: program/instruction trace, data trace, and bus (or interconnect fabric) trace. Each of these functions has different usage models and different costs.

Program trace is highly valuable for both hardware and software debugging as well as for the main source data required for many profiling tools. The implementation costs of program-only trace macrocells can be quite small. The ARM Cortex-M3 processor has a program trace ETM of approximately 7 kgates, and the data compresses well, requiring only about 1 bit/instruction/ CPU. So the bandwidth requirements for a trace port aren’t too high, even for a 4x CPU multicore SoC with a 500-MHz to 1-GHz CPU clock.

Where on-chip trace buffers are implemented, a 4k RAM can hold more than 30,000 lines of assembler code execution. That’s a lot of code for an embedded developer to review. Furthermore, profiling tools like the ARM RealView Profiler, Green Hills Software’s TimeMachine, Lauterbach’s TRACE32, and Real-time Trace Reconstruction (RTR) from iSystems can continuously process program trace data in real time for cores up to 400 MHz for runs of several hours, or even days, if required. Adding cycle-accurate instruction trace, useful for close correlation of the interaction of multiple processors, increases the bandwidth to about 4 bits/instruction, which substantially increases the required frequency and width of a trace port.

But some classes of bug need to see the data (data addresses and/or data values). The data drives many process-control algorithms, so watching parameters over time is important. Some difficult to replicate system bugs are the result of datacoherency errors in hardware, system configuration, or software.

Several debug tools contain a powerful feature where all debugger windows, including processor register values, can be recreated from the data trace. As a result, a programmer can step forward (or backward) through code actually executed in real time in the real environment, showing its real misbehavior.

Unfortunately, the cost of implementing data trace is the highest of all. Trace macrocells need to be larger. Data is more difficult to compress. (Data trace from an ARM ETM typically requires one to two bytes/instruction.) Trace buffers need to be larger. And, trace ports must be faster.

Yet the upside of higher SoC-integration levels is that the gates can be squeezed into ever-smaller areas, so even high-performance multicore systems can have datatrace capabilities if required. Multiple onchip trace buffers can be implemented, or trace ports using high-speed physical layers (PHYs) now can support multiple gigabit lanes. Today’s technology supports up to six lanes at 6 Gbits/s, which is enough for full, cycle-accurate simultaneous program and data trace of three ARM cores running at about 600 MHz.

Sizing the trace port is another key task for the ASIC designer and another tradeoff decision derived from the cost of implementation versus the level of trace functionality. For multicore SoCs, the best approach may be a combination of solutions. For example, by fitting three parallel trace funnels, any subset of trace data may be sent to one of three destinations: a very high bandwidth interface to on-chip trace buffers, a medium bandwidth trace port to very deep off-chip buffer, or a very narrow (even single pin) interface for continuous monitoring.

This delivers a trace solution that can provide for almost any usage case from hardware fault analysis, where the instruction-by-instruction code and data is recorded over a period of thousands of cycles, through software debug and profiling of multicore code over trillions of instructions, to a high-level application generated trace available even from the most connectivity-challenged end product.

MULTIPLE TRACE SOURCES As with debug ports, fitting multiple trace ports—one for each core—has obvious silicon and pin overheads. One solution is to use a CoreSight trace funnel that combines multiple, asynchronous, heterogeneous trace streams into one for output via a single trace port or trace buffer (Fig. 4, again).

This provides better visibility, a higher- bandwidth port, or deeper buffer for a single-core use. It also reduces the implementation overhead (area, pins, and tools) substantially when simultaneous trace of multiple cores is required. Furthermore, it’s an ideal mechanism for extending support for any source of trace data.

Other sources of trace data may come from trace macrocells for DSPs, bus monitors, embedded logic analyzers, or in-silicon validation logic (e.g., synthesized assertions). The essence of the system is to provide a data path from trace data generation to a file on the developer’s workstation for use by the tool that configures and displays the data for the respective trace generator.