Combat Integration's Dark Side With New Development Tools

DESIGN VIEW is the summary of the complete DESIGN SOLUTION contributed article, which begins on Page 2.

Heisenberg stated in his studies that the observer is no longer external and neutral, but rather part of the environment being observed. In other words, the mere act of measurement alters the observation. A similar perplexing situation exists when developing products with the latest microelectronic devices.

The increasing chip-integration levels that dominate today's electronic designs propel the problem. Though designers save size, power, and cost with shrinking geometries, they struggle with what's termed "vanishing visibility." In other words, higher integration levels tend to hide a chip's inner operations.

Conventional development tools can no longer handle the design and debug chores for products that use the newest digital signal processors and microcontrollers. A new trend is under way, however, that involves integrating on-chip debug facilities in an effort to reverse the loss of visibility. Also helping the visibility cause is a new class of tools from chip suppliers and tool vendors.

Nonetheless, every increase in system clock rates threatens the latest debug approaches. This article delves into visibility challenges facing designers, available tools, on-chip technologies that provide visibility, and what the future may hold.

HIGHLIGHTS:
Chip Suppliers' Visibility Challenges	Chip makers face four key challenges in trying to deliver low-cost, visibility-enabled debug solutions: system complexity, hostile applications, debug bandwidth, and applications diversity.
Developers' Visibility Tool Chest	Chip vendors now supply a tool chest that targets the two main classes of visibility problems encountered by developers: algorithm or data-related problems, and program-flow or control-related problems.
On-Chip Technologies Supply Visibility	Four on-chip debug technologies together provide increased visibility in today's chips: trace, triggering, data I/O, and pin management.
What The Future Holds	In the not too distant future, it should be possible to obtain real-time trace with no gaps by using a clock or two, roughly a pin per 100 MHz of CPU clock frequency, and employing conventional single-ended buffers for trace-export transmission.
Sidebar: The History Behind Vanishing Visibility	System visibility has dropped steadily over time, but integration continues to increase. Only recently has visibility rebounded to some extent.

Full article begins on Page 2

When scientists started to work at the subatomic level, they ran into a perplexing problem. As Heisenberg stated in his studies of quantum physics, the observer is no longer external and neutral, but rather part of the environment being observed. In other words, the mere act of measurement alters the observation. A similar perplexing situation exists when developing products with today’s microelectronic devices. It’s driving the creation of a new class of development tools.

Spurring on this evolution is the increasing chip-integration levels that dominate today’s electronic designs. Chip designers may enjoy the limelight with the size, power, and cost savings created by shrinking geometries. But software developers battle the "dark side" of the chip-integration story: — "vanishing visibility." As if dealing with increased system complexity wasn’t tough enough, these folks live in a murky development world where higher integration levels hide the inner operations of a chip.

Forced to use conventional debug techniques, they find that the mere process of monitoring chip activity can alter system timing, and thus the system itself. Naturally, the combined weight of these obstacles impedes timely project completion. Conventional development tools can no longer handle the design and debug chores for products that use the latest digital signal processors (DSPs) and microcontrollers (MCUs).

Fortunately, a trend is now underway to reverse this loss of visibility by integrating on-chip debug facilities. Recently, chip suppliers and tool vendors have responded to developers’ cries for help with a new class of tools representing yet another step toward the historical goal of making a system’s inner workings visible. Storm clouds are on the horizon, though, as every increase in system clock rates threatens the latest debug approaches. This article discusses:

The history behind vanishing visibility
Chip suppliers’ visibility challenges
Developers’ visibility tool chest
On-chip technologies providing visibility
What the future holds—are you prepared?

If you’re one of those developers who always gets it right the first time, you can probably stop here. But beware, someone else will cause you grief, so put things into perspective by taking a look at the state of debug over roughly four decades (see "The History Behind Vanishing Visibility").

Chip Suppliers’ Visibility Challenges Chip makers are under considerable pressure to provide both low-cost and visibility-enabled debug solutions. This calls for maximizing I/O bandwidth on a limited number of debug pins. At the same time, nonintrusive instrumentation must be supplied at full speed. In delivering this solution, chip makers face four key challenges:

System complexity
Hostile applications
Debug bandwidth
Applications diversity

System Complexity:
Compared to just a decade ago, it’s utterly amazing how many functions we can pack onto one chip. Not only are devices incorporating more of the functionality previously assigned to separate peripheral chips, they’re adding large memory caches and even multiple CPU cores with sophisticated interfaces between them. On top of that, analog functions also abound in today’s chips.

Because today’s architectures don’t always make traditional buses available at the chip boundary, one can’t look "at the edge" of a chip to see everything that’s going on inside. Consequently, the value of adding a logic analyzer to the chip pins has diminished dramatically in this environment.

Some chip subsystems bring their own special complications. Cache memory, for instance, can increase system performance but also obscure device actions. Some CPU cores today run at well over a gigahertz, with external interfaces running at rates as high as 250 MHz. Today’s high-performance systems have several cache levels on-chip to keep the CPU busy, using external memory as a slow buffer that’s accessed only as needed. Traditional debug technologies rely on monitoring the external memory bus to see what the processor is doing.

You can appreciate how this situation cripples the ability to monitor real-time activity when the external I/O is running at perhaps a quarter of the CPU speed. In addition, consider the complications that arise when a chip has two or more dissimilar cores, such as a DSP and an MCU. Now the tool chain must deal with two instruction sets and a large number of extra buses and signals.

Hostile Applications: Of course, the enormous flexibility that comes with their ability to put so many different functions on a chip makes them popular building blocks in all sorts of systems. Unfortunately, some of these applications may be thought of as "debug hostile." An excellent example is a handheld consumer device like a cell phone. The chip(s) it uses must be as small as possible, meaning that there are no free pins to devote to providing debug data. Historically, little space has been allocated for on-chip debug logic.

But with developers struggling to get their applications working, traditional thinking has changed. Today, the "chip cost is everything" mentality that was widely held several years ago is no longer the mantra. Time-to-market benefits provided by on-chip tools are now being given serious consideration.

At the other extreme are problems that arise in very large embedded systems, such as in the telecom infrastructure. When you’re trying to troubleshoot a system that consists of multiple racks filled with densely populated cards, how do you gain access to the specific DSP that you suspect might have some debug information of interest? Even if you could physically access the correct card, how would you get to the edge of the chip if it were in a physical environment where there’s no access to the leads? Consequently, designers must be willing to allocate the board space to gain access to visibility information. In this class of application, trading channel density for time-to-market is the name of the game.

Debug Bandwidth: Higher integration brings yet a third class of problems, this time related to clock speeds. Systems are now more parallel and running faster than ever. CPU clock rates are climbing into the gigahertz range with the amount of debug data roughly proportional to the clock rate. Unfortunately, exposing internal bus activity directly at the I/O pins is an exercise in futility because conventional I/O toggle rates can’t keep up with internal bus rates. Thus, much of the debug information generated must be encoded and compressed to create a manageable data volume.

Even after encoding and compression, some chip debug facilities may send many megabytes of debug data to a trace recorder just to describe less than a second of program execution. With this information volume, users will need a sophisticated data-collection and post-processing mechanism to analyze and display the information stream.

Applications Diversity: The fourth issue concerns diversity in the development environment, whereby chip suppliers want tools to operate across a varied set of applications. While some applications are very cost-sensitive (limiting the amount of debug logic that can be put on-chip), others are pin-limited. Therefore, debug pins come at a premium. Yet other applications are so time-to-market driven that strong on-chip debug facilities are necessary. Clearly, these suppliers must make compromises.

Achieving the proper balance is a challenge, as the cost-benefit relationship also depends on the skills of the system developers, maturity of the application, and the affordability of an on-chip debug solution. These constraints all affect the decision to spend gates and pins for debug.

Developers’ Visibility Tool chest The visibility solutions offered by chip vendors are shaping the architecture of next-generation debug tools—and the way developers approach their work. These vendors now supply a visibility tool chest that targets the two major classes of problems typically encountered by developers: algorithm or data-related problems, and program-flow or control-related problems. These two headaches are attacked using very different capabilities within the tool chest.

Algorithm Or Data-Related Problems: With the data-related problem class, one knows that an algorithm is stable, but its behavior may be nonoptimal. In this case, one looks at a system software component as a transfer function and determines the exact relationship of the inputs to the outputs. In some instances, developers would like to view data continuously streaming off an application and examine the particular data values pertinent to the application’s success.

Chip suppliers have responded to this need with real-time data-transfer technology. This technology appeared in early second-generation tools roughly 10 years ago, with some vendors only offering it recently. It supports bidirectional data transfers between the development tools and target system. Some of these solutions even provide enough bandwidth to stream application-generated video through the emulator for display on the host.

Also, evaluating algorithm performance is getting tougher. Assume, for instance, that you’re designing an embedded system that performs some video processing. A DSP won’t necessarily provide convenient external taps with which you can monitor the video flow at progressive stages in the processing. But when you’re developing an algorithm, you want the ability to monitor the quality of the video output stream at various stages.

To address such situations, chip vendors added tools for high-speed data transfer into and out of the chip. To see a demo of this high-speed data transfer in a video application, go to www.ti.com/rtemulationarticle. The speeds of vendors’ links are sufficient to handle most applications. The collection and export, or reception and use, of data are generally built permanently into the application, and are turned on when needed. For a tools-to-chip transfer, the on-chip debug logic must import the data, and the application software must disposition the data. For chip-to-tools transfers, the reverse happens. Figure 1 shows the data rates required to monitor various applications.

The technology first appeared with the introduction of scan-based Texas Instruments’ (TI) debug tools. This toolset introduced the world to scan-based emulation and delivered a real-time data exchange supporting audio-bandwidth needs. Using a standard JTAG scan connection, developers can transfer debug data to or from one or more TI DSPs and/or ARM microprocessor CPUs on the same or different chips.

Recently, the introduction of the XDS560 series emulators raised developer expectations for this technology. This emulator and TI’s DSPs support a high-speed data-transport link called High-Speed Real-Time Data eXchange (HS-RTDX) at greater than 2 Mbytes/s. The approach uses a direct emulator-to-chip connection, bypassing the JTAG scan path. Other chip vendors have recently embraced this type of technology, offering limited versions of their own. With today’s integration levels, this capability is very inexpensive on the DSP because it doesn’t require very much logic, and doesn’t significantly raise cost.

In practice, developers insert calls in their application to open and enable an output channel. Then they log data of interest into application memory. A driver manages the export of collected data to the emulator or host via a standard communication link—for either real-time concurrent analysis or post-acquisition analysis. Developers can insert many collection points into the application, making this type of visibility solution both inexpensive and flexible.

Although this approach is quite appealing, it has two drawbacks: storing code and data requires memory, and it adds CPU cycles when code executes. The developer controls the size of data buffers and the number of collection points, and therefore the intrusiveness. Because the code has very deterministic behavior, a developer can easily plan for the MIPs (millions of instructions per second) impact. Despite these drawbacks, this approach is quite useful when the developer knows what data needs to be collected or dispositioned. Its strength is that it can efficiently instrument a number of points in the program with minimal on-chip debug hardware.

Program Flow Or Control-Related Problems: With the control-related problem class, the program is unstable due to its design, memory-system behavior, and missed real-time deadlines, among other factors. Also, the developer typically wants to track these issues down without the instrumentation changing system operation.

In this case, recording execution history is the developer’s choice, especially when the recording occurs over an extended period of time. This means that the execution history can’t be stored on-chip as the storage requirements are excessive. Instead, program-execution history is exported off-chip to large buffers within an emulator or logic analyzer. Using this execution history makes it easier to locate classic code-runaway and deadline problems. Plus, profiling aids can identify code hot spots and other performance-related issues.

Finding stability problems or missed deadlines requires a record of the program flow for extended periods. For instance, a vocoder might miss a frame every now and then. Simply determining that this problem exists is difficult enough. But determining why it happens can be a real challenge.

Following the program flow exposes these failures. But herein lies a problem—one needs to monitor a program address bus operating at very high speeds. However, outputting the program counter value for every instruction simply isn’t practical. A simple calculation reveals that a CPU with a 32-bit program counter can generate a byte of trace data at four times the CPU instruction execution rate. In fact, great pains must be taken to encode/compress the instruction-execution view to minimize the data volume and permit its export using a reasonable number of pins.

Even after encoding and compression, tracing program flow and timing generates 100 to 1000 times more data than the data-driven mode. If program flow and timing are monitored for seconds at a time, the amount of data necessary to export and record jumps from mere megabytes to gigabytes!

It takes a long time to detect some problems within a system, so it could take the recording of lots of data to isolate the toughest system problems. Because the exported data is heavily encoded, the host system is used to reconstruct what’s going on in the target system and make this information available to the developer.

On-chip Technologies Supply Visibility Four on-chip debug technologies together provide increased visibility in today’s chips:

Trace
Triggering
Data I/O
Pin management

The resulting on-chip debug architecture is similar to the one shown in Figure 2.

The trace and triggering portions of this solution are implemented like a logic analyzer, with the analyzer’s triggering and collection section on-chip, and it’s recording buffer in the emulator—external to the chip. However, the on-chip triggering and trace combination can rival CPUs in complexity. In some cases, their irregular implementation makes them more complicated than CPUs and other system components. The data I/O portion is really a DMA-driven serial port of some sort, while a pin manager assigns the desired functions to pins at run-time.

Trace: Some general-purpose-processor (GPP) manufacturers already offer exported trace for lower-frequency CPUs. A number of these GPP companies have gone back to the drawing boards to create a solution for high-performance CPUs. DSP suppliers are also entering the fray, mimicking and substantially improving on the MCU offerings.

To gain some insight into the trace problem, let’s explore the drastically simplified view of the trace collection, export, and decode process shown in Figure 3. Our processor has a 32-bit program counter and two data buses—each with a 32-bit address and 64 bits of data. Assuming all buses are active each cycle, it’s possible that 224 bits of trace information may be generated each clock cycle. If 10 pins delivering 200 Mbits/s per pin are dedicated to trace export, a 62:1 nonlossy compression would have to be deployed if the CPU is running at 500 MHz. Because memory references don’t encode and compress well, achieving this compression efficiency is impossible. So, filters must be used to restrict the memory-reference information to only that which is determined valuable.

Triggering: Triggering provides a way to selectively trace (filter) program flow, timing, and memory references. Traditional on-chip triggering consisted of simple comparators used for breakpoints. Today’s advanced capabilities add counters, event detectors, and state sequencers. These parts all reside on the chip, alongside the embedded buses and signals. One real-world implementation of such selective triggering is the advanced-event-trigger (AET) capabilities built into TI’s DSPs. Over the past several years, much richer AET triggering and sequencing capabilities have been implemented. This also holds true for other chip suppliers.

Trace and Triggering Combination: As powerful as triggering and trace are by themselves, they have weaknesses. For instance, what if a write that corrupts memory is detected with a trigger, while the sequence of events leading up to the bad write isn’t visible? Why the write happened may not be obvious.

Trace without filters has weaknesses as well. The main problem here is that the information volume can overwhelm either temporary buffers on chip, or humans trying to interpret the volumes of data. A combination of triggers and trace makes it possible to look at program flow and associated memory references on very specific conditions, thereby reducing the amount of exported data to a manageable level.

Triggers on failures can stop trace recording. Triggers can also use a multilevel state machine to detect a complex sequence of events that involve either hardware or software misbehavior. Such flexibility proves especially useful late in the design cycle when complex or intermittent problem sequences arise. The co-location of debug triggers and critical system buses/signals determines whether information is important or not, in real time. Such a capability lets a developer pinpoint the origin of code problems in hours instead of months.

Data I/O: As stated previously, there are two forms of data I/O: DMA-driven data transfers through the scan interface (JTAG), and high-speed versions with a DMA-driven bidirectional serial port. In both cases, the chip vendor supplies a driver with one or both of these capabilities standard fare and encapsulated with the CPU core.

Pin management: Although trace, triggering, and some forms of data I/O all require pins to operate, not all capabilities are needed at once. A pin manager is included to distribute the available debug pins to the capabilities that need them the most, with trace being especially pin-hungry.

The number of pins used for trace export is very important. Each pin dedicated to trace must be operated at the highest bandwidth possible. Even with selective triggering and data compression, trace takes between 4 and 20 times as many pins as data I/O, which requires just one pin. You can consider the CPU frequency as a multiplier of the pin needs because the debug information increases proportionally to the CPU frequency. Also, the quality of the debug architecture affects the pin count. Less efficient encoding and compression schemes certainly use more pins, and all trace protocols aren’t created equal.

Pins are precious, so gates are spent to improve the encoding and compression, thus reducing pin consumption. Leading technologies use single-ended I/Os to export trace data at the highest rate practical per pin, with the number of pins required directly proportional to the encoded and compressed information volume. They also allow for the addition of trace pins in small increments, with debug port pins programmable as trace, data export, or other functions (e.g., input or output triggers).

What The Future Holds Unquestionably, the "chip-cost-is-everything" era has ended. The plight of software developers and missed project deadlines has awakened program managers. They’re now aware of the need for a stronger toolset. On top of that, hardware developers must relinquish some of their control over the chip features.

Fortunately, shrinking geometries and low-cost packaging technologies are making it ever more affordable to include on-chip debugging capability. Consequently, this required technology is becoming more palatable for everyone. Today’s combined triggering and trace solutions consume 100 to 200 kgates. This certainly makes the technology affordable compared to months of schedule delays. We all know that "if you can’t see it, you can’t fix it." A few encounters with this law of nature secure the case for on-chip debug.

In the not-too-distant future, it should be possible to obtain continuous real-time trace with no gaps by using a clock or two, roughly a pin per 100 MHz of CPU clock frequency, and employing conventional single-ended I/O buffers for trace-export transmission. Some technology in development today may deliver continuous visibility well into the several-gigahertz clock range. Given that trace export generally transmits the data through the target board, the emulation header, and as much as 6 to 12 inches of cable to get to the emulator, these numbers are truly impressive.

Now that you know what the benchmark is, you know what your competitors have at their disposal. Deciding how much debug to deploy is like purchasing insurance. You never know for sure in advance how much you'll need. But the penalty for not having enough can range from annoying to disastrous.

How much are you willing to invest in chip and debug tools to ensure your project will be successful? Using an emulator-friendly data-exchange port on the chip incurs the lowest hardware cost, essentially the minimal "insurance" policy. But it doesn't do much to tackle system crashes and timing problems. The more expensive, higher-performance coverage, triggering, and trace technology may be needed for the right visibility to find your subtle, annoying bugs, providing better "insurance" protection for successful product development. Always remember too, regarding product bugs, if you can't find them, you can't fix them!