Goodbye, Mr. DSP

Is SoC technology passing by DSPs? Steve Leibson explains why he thinks that's the case.

The movie "Goodbye, Mr. Chips" tracks British schoolteacher Charles Edward ("Chips") Chipping's career from his early professional days, teaching the classics at fictitious Brookfield school, to his old age. Predictably, at the end of the movie, "Chips" passes on after a life well lived. Like Mr. Chips, digital signal processors (DSPs) have already entered their old age, at least as on-chip processor cores. They've served the industry well, but SoC technology has passed them by.

Digital signal processing is now mainstream technology, so it seems heretical to be declaring the end of the road for DSP cores. All sorts of media processing—ranging from voice to music, to still images, to video—require DSP functions.

DSPs first appeared as chips in the early 1980s because general-purpose processors of the day could not deliver sufficient performance for the signal-processing tasks of the day. Early DSP architectures were shaped by the signal-processing algorithms they were created to run and every feature in a DSP accelerates some computation in a signal-processing algorithm.

MAC attack

Early general-purpose processors lacked hardware multipliers, which consumed a relatively large number of gates for the time. Yet signal-processing algorithms such as finite- and infinite-impulse-response (FIR and IIR) filtering are full of multiplications followed by accumulation of the multiplication products. As a result, DSPs have incorporated hardware multipliers and MAC (multiplier/accumulator) units ever since Texas Instruments introduced the first commercially successful DSP (the TMS32010) in 1982.

Today, thanks to the relentless advance of Moore's Law, MAC units just aren't that large relative to other blocks placed on a chip. Configurable processor cores like Tensilica's Xtensa family have optional MAC units.

More units, fewer cycles

The high computational requirements of signal-processing algorithms spurred DSP designers to add parallel, independent execution units. Parallel execution units such as ALUs, shifters, and address-generation units allow a DSP to execute the inner loops of algorithms in fewer cycles.

Although many general-purpose processor cores don't have multiple, parallel execution units, it is possible with configurable processors. This trait is one of the advantages of the plastic configurable-processor architecture. In fact, if a fused instruction that performs an addition, a shift, and a next-address calculation will speed an inner loop, that instruction can easily be added to a configurable processor's instruction set and software-development tool chain.

Fast memory access

High-speed computations aren't the only high-demand operations conducted during the execution of signal-processing algorithms. High-speed computation units need a stream of operands, and the results of DSP operations create a corresponding result stream. Early DSP designers adopted non-standard memory architectures that could perform multiple memory access per cycle to increase memory bandwidth. The most widely adopted approaches are Harvard architectures (separate memories for instructions and operands) and the XY memory architecture.

Signal-processing algorithms frequently include loops with predictable memory-access patterns. Specialised DSP address-generation units can exploit this predictability using specialised addressing modes, such as indirect addressing with post-increment, circular, and bit-reversed addressing that efficiently index the operands stored in memory. These addressing modes accelerate a wide range of signal-processing algorithms, including FIR filtering and the fast Fourier transform (FFT).

Configurable processor cores offer all of the memory-architecture options developed for DSPs, with the requisite address-generation units to accelerate algorithm execution. Therefore, these features also are no longer unique to DSPs.

Performance at the pinnacle

It's often possible to independently execute the same operation on multiple data words within the inner loop of a signal-processing algorithm using SIMD (single-instruction, multiple-data) execution units. DSPs now feature SIMD execution units with multiple adders, multipliers, or MACs. For algorithms where SIMD execution is useful, the parallelism can be quite high. A 4- or 8-way SIMD unit can effectively accelerate an inner loop respectively by a factor of four or eight. Like the other features discussed above, SIMD execution units are no longer available in DSPs alone.

High-performance DSPs have become VLIW (very-long instruction word) machines that issue multiple independent operations to their parallel execution units each cycle. VLIW processors require wider instruction words with perhaps 32 or 64bits (or wider) per instruction instead of 16.

The added ability to execute multiple independent operations per clock cycle needn't incur code bloat. For example, Tensilica's Xtensa LX processor core has a VLIW-like feature called FLIX (flexible-length instruction extensions) that adds 32 or 64bit multi-issue operation bundles to the processor's existing 24/16-bit native instruction set. The compiler selects FLIX instructions if they're more efficient than the equivalent sequence of native instructions, which greatly accelerates loops. In control code (all signal-processing algorithms are laced with such code), parallelism is generally not helpful. As a result, the compiler selects the processor's narrower native instructions.

So what's the difference?

Automated compiler selection of appropriate instructions opens this discussion to a major difference between DSPs and augmented configurable processor cores. In general, the DSP's highly specialised, irregular, and complicated instruction sets; small register files; and irregular memory architectures make them poor compiler targets. The resulting code is relatively inefficient because the compiler must translate from C to the DSP's irregular instruction set and small register complement. Because signal-processing algorithms generally require tight, efficient code to meet performance goals—especially in their inner loops where the bulk of the processing is performed—DSP code is often written by hand in assembly language.

Conversely, the general-purpose configurable processor is a good compiler target. Configurable processors excel at executing control code. DSP enhancements are used within the signal-processing algorithm's inner loops, where the compiler can best harness these specialised instructions. When properly implemented, even these DSP-enhancing features are easily and efficiently employed by compiled code. DSP-enhanced configurable processors offer the performance benefits of DSPs with the added benefit of remaining good compiler targets, which reduces the overall coding burden on the SoC development team.

In summary, DSP cores no longer offer the SoC design team any performance advantages over configurable processor architectures. All of the DSP architects' good ideas have become a configurable processor's optional abilities. At the same time, configurable processors retain their ability to execute control code and they remain better compiler targets. Like Mr. Chips, DSPs have led the way to a variety of performance-enhancing architectural features, but their time to serve as on-chip processors has passed.

What do you think?

Do you agree or disagree with the statement that DSP cores no longer offer the SoC design team any performance advantages over configurable processor architectures? Let me know. E-mail me at [email protected].