Denser, Faster Chips Deliver Knockout DSP Performance

When you look at it, the high-performance DSP market kind of resembles boxing. It's stratified, with ranks much like the featherweight, lightweight, welterweight, and heavyweight classes. You'll find multiple champions, each with the best high-performance solution for its specific strata.

Audio DSPs range from low-cost consumer solutions to 24-bit, high-precision chips that target professional audio systems. Video solutions start with low-resolution quarter-CIF content processing and span up to HD widescreen. High performance is in the eyes of the application, and every category offers "best-in-class" DSPs.

Best-in-class solutions can combine a number of factors, such as clock speed, overall throughput, on-chip integrated features, and power consumption. It all depends on the application's needs. That said, all designers should follow a couple of good rules of thumb.

First, faster is usually better. At higher speeds, the DSP engine can run more iterations of an algorithm or run a more complex algorithm in a fixed amount of time. Higher speed also provides a safety margin for algorithms to grow in complexity or precision, enhancing the application.

Second, higher levels of integration are generally a good thing. Complex algorithms can always take advantage of more on-chip memory, or multiple math units can be integrated for more parallel performance.

Several DSPs have hit 1-GHz clock speeds. Even higher speeds are possible as process technology advances to 65- and 45-nm regimes. The smaller features also open the door to higher levels of system integration on the DSP.

Chips aren't just compute engines anymore. They also contain abundant amounts of on-chip memory, flash storage to hold programs or data, Ethernet ports for network communications, and other features that expand their role into the control plane.

WHAT'S OUT THERE? One DSP clocking in at 1 GHz hails from Texas Instruments. The TMS320C6455 packs 2 Mbytes of on-chip RAM, Serial RapidIO ports for high-speed data transfers (25-Gbit/s aggregate throughput), a 1-Gbit/s Ethernet media-access controller, a 66-MHz PCI bus, and a double-data-rate external memory interface (Fig. 1).

Based on an enhanced C64x+ very-long-instruction-word (VLIW) core with new instructions, the C6455 boosts cycle efficiency by about 20%. The core yields 20% to 30% more compact code space than the previous-generation C64x core. As a result, the C6455 delivers a two to 12 times overall performance improvement. An enhanced multiplier-accumulator can execute up to eight 16- by 16-bit multiplications per cycle for improved math throughput of 8 GMACs/s at 1 GHz.

Shoehorning in as much as 3 Mbytes of on-chip RAM, the TigerSHARC DSPs from Analog Devices can deliver throughputs of up to 4800 million fixed-point MACs/s or 3600 million floating-point MACs/s when running at 600 MHz. The processor architecture consists of two computational blocks that can operate independently, in parallel, or as a single-instruction/
multiple-data (SIMD) engine (Fig. 2).

Two compute instructions per computation block can be issued every cycle, instructing the ALU, multiplier, or shifter to perform independent, simultaneous operations. Each computational block contains four computational units—an ALU, a multiplier, a 64-bit shifter, and a 32-bit register file. The TS201S also includes a communications logic unit that increases the number of complex multiplies per cycle performed by the chip.

Freescale Semiconductor's high-throughput offerings, the MSC8122 and 8126 DSPs, clock at 400 to 500 MHz. However, their integer throughput extends to over 8 GMACs/s. These chips contain four instances of the Starcore SC140 DSP core, and each packs a compute block that contains a data-arithmetic unit, an address generation unit (AGU), and a program sequencer (Fig. 3).

Inside the data-arithmetic unit is a 16-word by 40-bit register file, a bit-field unit for bidirectional shifting, and four ALUs (each ALU consists of a 16-bit multiplier and accumulator) for a total of 16 MACs. When clocked at 500 MHz, these MACs deliver the 8-GMAC/s throughput. The AGU contains address registers and logic to address data operands in memory.

Other on-chip resources include 1.43 Mbytes of SRAM, a 10/100-Mbit Ethernet controller, a 32/64-bit SDRAM memory controller, a 32- or 64-bit host port, and four time-division multiplexed interfaces (each able to support 256 channels for connectivity to T1/E1, MVIP, and H110, as well as a bandwidth of up to 6 Mbits/s per TDM interface).

The MSC8126 also packs some application-specific DSP blocks to optimize the chip for communications applications—a turbocoding coprocessor and a Viterbi coprocessor help accelerate wireless baseband processing capabilities. Thanks to these extra blocks, the MSC8126 can provide up to 80 complete symbol-rate channels of 3GPP voice AMR channels at 12.2 kbits/s, or 20 data channels of 3GPP at 384 kbits/s (including symbol-rate and chip-rate assist functions).

Analog Devices, Freescale Semiconductor, and Texas Instruments all offer application-targeted versions of these DSPs, with resources optimized for applications such as Voice-over-Internet Protocol (VoIP), video processing, and wireless basestations. So when it comes to selecting a general-purpose or application-optimized off-the-shelf DSP solution, many choices abound.

THE JOYS OF LICENSING In addition to off-the-shelf products, designers can roll their own custom DSP chip by licensing cores from 3DSP, AMI Semiconductor (formerly Foundry DSP), CEVA-DSP, LSI Logic (ZSP cores), Starcore, and other suppliers of intellectual property (IP).

General-purpose microprocessor cores targeted at embedded applications are getting DSP smart, too. Cores from ARC, ARM, MIPS, Tensilica, and other IP vendors include DSP support in the form of hardware multiplier-accumulators and specialized DSP instructions. The specialized instructions can considerably improve the throughput of algorithms that might otherwise not be practical if executed using a generic instruction set.

Stretch Inc. took one of Tensilica's cores and tightly integrated programmable logic into the datapath. The programmable logic lets designers implement hardware-optimized instructions for DSP and other applications (see "FPGA Flexibility Lights A Fire Under DSPs" at www.elecdesign.com, ED Online 10675). With its DSP instruction enhancements, the Stretch solution performs comparably to, or even better than, dedicated DSP cores from CEVA-DSP and other companies.

HIT THE LAB With the ability to integrate millions of gates on a chip, designers can explore new architectures to further enhance DSP performance. In some areas, the basic Harvard architecture has already ceded to VLIW approaches. In turn, these approaches leverage the ability to integrate multiple on-chip compute blocks (several multiplier-accumulators, full arithmetic and logic units, barrel shifters, etc.) and control them in parallel. Major companies going this route include Analog Devices, Philips Semiconductors (the Nexperia series), and Texas Instruments.

Designers also can experiment with even more highly parallel architectures that squeeze multiple DSP and compute engines onto a single piece of silicon. Over the last few years, more than two dozen companies have been spawned solely for leveraging SIMD and multiple-instruction/multiple-data (MIMD) architectural implementations.

Many of these companies haven't been successful. But Cradle Technologies, Elixent, Freescale (whose reconfigurable compute fabric is based on the Morpho Technology computer fabric), and a few other companies (e.g., Silicon Hive and ClearSpeed Technology) have demonstrated some exceptional compute throughputs of 30 to 50 GFLOPS.

While designers can craft their own DSP ASIC or adapt an algorithm to run on a commercial SIMD or MIMD solution, they also can implement a custom solution on an FPGA. Just about all of the latest-generation FPGAs include dedicated multiplier-accumulator blocks that can be configured into large compute arrays to tackle algorithms such as fast-Fourier transforms (see "FPGA Flexibility Lights A Fire Under DSPs," again). Other filter schemes can exploit the large number of multiply-accumulate operations possible in the FPGA fabric.

FPGAs can supplement the DSP chip, offloading the most compute-intensive portions of the algorithms. Designers could then use a lower-cost/lower-performance DSP as the main DSP in the system. Alternatively, the ability to embed soft processor cores on the FPGA logic fabric, as well as use embedded DSP resources, allows the FPGAs to totally replace the DSP chips and even a control-plane RISC processor with a single-chip solution.

NEED MORE INFORMATION? 3DSP Corp.
www.3dsp com

AMI Semiconductor Inc.
www.amis.com

Analog Devices Inc.
www.analog.com

ARC International
www.arc.com

ARM Ltd.
www.arm.com

Audiocodes Ltd.
www.audiocodes.com

Berkeley Design Technology Inc.
www.bdti.com

CEVA-DSP
www.ceva-dsp.com

ClearSpeed Technology lc.
www.clearspeed.com

Cradle Technologies
www.cradle.com

DSP Architectures
www.dsparchitectures.com

DspFactory (now part of AMI Semiconductor Inc.)
www.amis.com

Elixent Ltd.
www.elixent.com

Eonic B.V.
www.eonic.con

Forward Concepts Inc.
www.fwdconcepts.com

Freescale Semiconductor Inc.
www.freescale.com

Hyperstone AG
www.hyperstone.com

Improv Systems Inc.
www.improvsys.com

Infineon Technologies AG
www.infineon.com

LSI Logic Inc.
www.lsil.com

MIPS Inc.
www.mips.com

Morpho Technologies Inc.
www.morpho.com

Octasic
www.octasic.com
Philips Semiconductors
www.semiconductors.philips.com

Renesas Technology Corp.
www.renesas.com

Silicon Hive
www.siliconhive.com
StarCore LLC
www.starcore-dsp.com/products

Stretch Inc.
www.stretchinc.com

Tensilica Inc.
www.tensilica.com

Texas Instruments Inc.
www.ti.com

Zoran Corp.
www.zoran.com