Harness Today's DSPs: Propel Tomorrow's Designs

Resource-rich configurable processors perform billions of operations/s to handle the most demanding DSP algorithms.

Dec. 18, 2003

16 min read

Designers always crave greater computational throughput in their DSP application s. More throughput equals richer DSP functionality, whether it's performing more exacting calculations to deliver better filtering or imaging or handling multiple tasks to eliminate additional components. To eke out as much performance as possible, new DSP architectures are coming equipped with configurable arrays of compute engines and blocks of memory.

In contrast, many existing commodity DSP chips employ some form of Harvard or very-long-instruction-word (VLIW) architectures, resulting in a general-purpose fixed-architecture solution. Most of their performance comes from raw clock speed and the use of multiple multiplier-accumulators, which usually operate as a single-instruction/multiple-data (SIMD) compute array. But today's chips, with clock speeds of 600 MHz and higher, have reached a performance plateau of several gigaoperations per second (GOPS).

Fixed-architecture array processors, such as the vector-accelerator on the AltiVec PowerPC processor from Motorola and Intel's digital media processor (the MPX5800 and 5400), are another option. These devices are well suited to deal with large arrays of sequential data. Yet their fixed architectures don't offer the flexibility of configurable processor arrays, which allow software to control the processor interconnections and data flow. FPGAs, at the other end of the spectrum, offer total flexibility but at the expense of individual-element performance. (For more, DRILL DEEPER 6980 at www.elecdesign.com to see the "Configure Your Own Custom DSP Solution" sidebar.)

Over the last few years, the ability to create large arrays of processing elements via software has improved dramatically as designers employ more advanced process technologies. By tailoring the array resources to the algorithm via software, you'll see an aggregate compute throughput at least an order of magnitude higher than current Harvard and VLIW architectures—and at the same or even lower clock speeds. This will provide room for perhaps yet another order-of-magnitude increase in performance as clock rates rise.

Applications range widely for these configurable DSP chips: software-defined radios, flexible cellular basestations, antenna beam-forming control, and high-throughput image- and voice-processing systems. To facilitate their development and implementation, though, software tools must be easy to use and robust enough to handle complex algorithms.

In this emerging area, many companies, mostly on the smaller side, offer a broad choice of architectural approaches that deliver throughputs of 20 GOPS and higher without pushing silicon clock speeds beyond 500 MHz. Some of the offerings come as intellectual property (IP), which designers can incorporate into an ASIC solution. Also, a few companies have "standard" silicon products that designers can use as an OEM product or in a prototype for a "proof-of-concept" implementation before embedding the IP into a custom solution.

FLEXIBLE ARRAY PROCESSING PACT XPP Technologies is one of the first to come up with a configurable array solution. The company dubbed its architecture the "extreme processing platform." Though PACT expects to license the technology to companies that want to embed it in a custom chip, its XPP64-A silicon architecture was developed to show the capabilities. Included on-chip are 64 ALUs/processing array elements (ALU-PAEs), 16 RAM-PAEs, four I/O interface ports, a configuration manager with a 1.4-Mbit cache memory that can hold several configurations, and built-in debugging support via a JTAG IEEE 1149 interface (Fig. 1).

The configuration manager is a specialized microcontroller that supervises the array's configuration. Its operating system manages the array resources and allows several configurations to be loaded onto the array. Configuration sequencing is performed on shared resources without deadlocks.

In all, about 51 million transistors are interconnected using six levels of copper. The combined throughput of the ALU-PAEs hits 4096 million multiply-accumulates (MACs) when clocked at only 64 MHz, leaving plenty of room for improved performance as the clock frequency increases.

The application software is defined by dynamic reconfiguration of operations and connections within the processor array. This eliminates the overhead associated with program sequencers and decoding logic. Each ALU-PAE block contains an eight-by-eight array of compute elements. Every element is made up of three sub-blocks.

A two-input, two-output ALU performs the main computations. A Back register provides routing path control in the vertical direction and a simpler ALU (the ALU portion of the Back register can be used for addition, barrel shifts, and normalization tasks). Finally, a Forward register also provides routing paths in the vertical direction. A specialized ALU in this register offers data-stream control, such as multiplexing and swapping.

The RAM-PAEs are similar to the ALU-PAEs, except that the main ALU is replaced by a dual-port 512-word by 24-bit storage array that can also double as a FIFO memory. A packet-based communications scheme is used between the ALU-PAEs and the RAM-PAEs. The RAM generates a data packet after an address packet was received at the read input. Writing to the RAM requires two packets, one with the address and the other containing the data word to be written.

In between the rows of ALU-PAEs are the data channels. These constitute a communications network that allows point-to-point and point-to-multipoint connections from outputs to inputs of ALU-PAEs, RAM-PAEs, and the I/O ports.

In one particular setup, PACT is partnering with QuickLogic Corp. to combine its configurable array with QuickLogic's QuickMIPS highly integrated system-on-a-chip platform. The result, an XPP prototyping platform, targets network infrastructure and digital consumer applications. Combining QuickMIPS and XPP delivers a high-performance and flexible system platform that can adapt to changing communication protocols and application demands.

Many other architectures developed by the roster of companies in this arena use a variation of the same basic architectural theme: All of the chips basically contain an array of compute engines, some data memory, and a control processor. The "magic" is in the way the blocks can be interconnected and controlled.

One approach, developed by Morpho Technologies and now licensed by Motorola, employs an array of reconfigurable DSP engines to form what it calls a reconfigurable compute fabric. The main building block of the MS1 architecture consists of 16 reconfigurable processors that are interconnected to a 16-bit datapath and a pipelined multiplier-accumulator. A context memory that can host from 32 to 512 context planes and a frame buffer holding up to 2 Mbytes are also part of each MS1 compute fabric.

Each processor cell contains an ALU, a MAC, and an optional complex correlator. The cells are coordinated by a 32-bit RISC processor that executes control algorithms developed by the company. These algorithms are available as part of the software libraries Morpho offers. Initially targeted at communications applications, the compute fabric can handle WDCMA, MPEG-4 encoding and decoding, and many other compute-intensive algorithms.

Designers at Motorola have taken the basic technology from Morpho and developed a commercial device, the MRC6011. Two on-chip compute blocks contain three reconfigurable-compute-fabric (RCF) cores each. (View a diagram of the MRC6011 as part of ED Online 6979 at www.elecdesign.com.) Each core, in turn, packs an array of 16 processor cells. To distinguish each core, a unique, software-accessible ID register is assigned to each RCF. This lets the software selectively assign calculations to a specific RCF.

In addition to the compute blocks, designers included an optimized RISC processor for efficient C-code compilation. Other peripherals include a two-channel input buffer; a large frame buffer with eight address generation units; a special-purpose complex correlation unit to support spreading, complex scrambling, and complex correlation on 8- and 4-bit samples; a direct-memory-access (DMA) controller; and a hardware reset capability for all RCF cores.

When internally clocked at 250 MHz, the MRC6011 delivers an aggregate peak performance of 24 Giga complex correlations/s, with a sample resolution of 8 bits for the I and Q inputs. If 4-bit samples are used, the throughput doubles to 48 Giga complex correlations/s. Such high throughputs will enable the processor to handle applications like baseband processing for 3G basestations, broadband wireless access systems, wireless LANs, and signal processing for advanced features (e.g., adaptive antennas and multi-user detection).

ADAPTIVE FLEXIBILITY Another company demonstrating an adaptive compute architecture, QuickSilver Technology, crafted an architecture called an adaptive computing matrix. It allows hardware functions to share the ACM core both spatially and temporally. In this scheme, software representing the particular function—a DCT, an echo canceller, a Huffman decoder, etc.—will be loaded and the function will be configured in the ACM. The data to be processed is then run through the configured block. Afterward, the new code can be loaded to change the function and new data run through, and so on.

The ACM's basic architecture consists of an array of similar but different nodes clustered in groups of four. Within the group and from group to group, the nodes are interconnected by a scalable, homogenous communications fabric. Four compute nodes reside in each cluster. First, an arithmetic node implements different linear, variable-width arithmetic functions, selectable on a clock-cycle-by-clock-cycle basis. A bit-manipulation node implements different variable-width bit-manipulation functions, also selectable on a cycle-by-cycle basis. A finite-state-machine node implements different high-speed complex state machines, also configurable on a cycle-by-cycle basis. Lastly, the Scalar node implements different complex control sequences. Also, configurable I/O nodes on-chip can be used to implement different interfaces (such as buses) to tie the chip into an external system.

The array of nodes is highly scalable. But because the different nodes are optimized for different subfunctions, developing an overall performance number for the ACM isn't appropriate. Yet in one test implementation, designers were able to implement most of the critical functions for WCDMA and cdma2000 software-defined radios. Every 52 µs, the ACM "builds" the hardware, runs the application, and tears down the hardware, all under software control.

Employing more complex configurable blocks, designers at Cradle Technology developed a shared-memory multiple-instruction/multiple-data compute subsystem that uses a single 32-bit address space for all register and memory elements. The processing subsystem contains four RISC-like processing engines (PEs), eight DSPs, and a memory-transfer engine (MTE) that incorporates four memory-transfer controllers (DMA engines for background data movement) (Fig. 2).

The processors are synchronized through the use of 32 semaphore registers within each grouping. One PE and two DSP blocks form a functional block known as a media stream processor (MSP). Thus, one full subsystem holds four MSPs plus the MTE, the caches, and a bus arbiter.

At a clock speed of 220 MHz, each DSP can deliver a throughput of 3530 million MACs/s, or 7 GOPS when processing 9-bit data. The DSP engine itself is a 32-bit processor with 128 registers and a local program memory of 312 20-bit instructions. The PE block is also a 32-bit processor with 16-bit instructions and 32 32-bit registers. The RISC-like instruction set consists of both integer and IEEE 754 floating-point instructions. In all, the Cradle chip can deliver an aggregate throughput of 11 GFLOPs and 28 GMACs (see "Scalable Compute System Cranks Out 11 GFLOPS," Electronic Design, June 23, p. 32).

MANY APPROACHES FOR HIGH PERFORMANCE In addition to the four basic architectures discussed so far, over a dozen more architectures are vying for a piece of the high-performance market. Silicon Hive, a spin-out of Philips Research, is developing synthesizable reconfigurable cores. The basic architecture is a hierarchical array, which at its lowest level consists of complex processing/storage elements (PSEs) that contain a number of register files, execution units, and local memory. Configurable interconnects within the elements allow the resources to be set up to resemble a VLIW processing engine or a flow-through computational datapath.

Silicon Hive has started sampling two implementations. One consists of a single cell that packs multiple PSEs and is aimed at processing frame data from channel and source codecs. The other implementation is a stream accelerator that includes an array of several cells, with each cell containing a single PSE. This version targets high sample-rate conversion applications, such as those found in 3G handsets and basestations, baseband processing in terrestrial and satellite radio, and other applications.

Taking a different approach to the reconfigurable solution, designers at MathStar crafted a field-programmable object array (FPOA). A silicon object is a 16-bit medium-grained function, such as an ALU, a multiplier-accumulator, a pattern-matching content-addressable memory, register files, and still other blocks. Each object has its own program and data memories and operates without the aid of global control. MathStar plans to implement the array in a 130-nm process that can clock at 1 GHz, yielding an aggregate compute throughput of tens to hundreds of gigaoperations/s.

The FPOA architecture creates an array made up of hundreds of individual objects that are loosely coupled using 16-bit datapaths and control buses. The objects can be independently configured, and multiple objects can be combined to form larger datapaths. The control path is bit-wise granular. Communication between objects is primarily nearest-neighbor, but the company has a proprietary "party-line" communications scheme that allows objects to communicate with more distant objects. The objects can change the communication patterns on a per-clock basis, which enables the array's function to change from clock to clock.

Able to deliver tens of billions of MACs per second, the PEP3G processing element from Cogent ChipWare employs a RISC-style instruction set with datapaths geared toward digital signal processing. When multiple PEP3G processors are combined in an array to form an Ivy Cluster, the combined processing power hits tens of gigaoperations per second. The Ivy Cluster array interconnection technology allows the company to efficiently crate configurable array processors. It combines the best of SIMD and MIMD paradigms to achieve extremely high throughput levels.

Another programmable alternative is the DAP/DNA reconfigurable processor from IPFlex, which contains a massive array of 32-bit processors called the DNA matrix. The 144 processors in the matrix are dynamically reconfigurable in just one clock and perform data-processing operations in parallel to achieve extremely high throughput levels. To control the matrix, the company uses a custom-developed RISC processor that runs at 100 MHz and can change the matrix configuration on a cycle-by-cycle basis.

Also entering the reconfigurable race is Elixent's engine, called the D-Fabrix array, along with a sample implementation called the DFA1000. The basic D-Fabrix array consists of an array of 4-bit ALUs, register, and memory block that can be combined to support variable data word widths. The ALUs are positioned in the style of a chessboard, alternating with adjacent "switchboxes" that control the signal routing.

The DFA1000 chip is a preconfigured version that includes various system peripherals to ease its integration into a host system. The DF1-1024 array can deliver sufficient throughput to perform UMTS Viterbi processing on about 1024 voice channels, perform an 8-by-8 DCT at 400 Mpixels/s or run JPEG encoding at 200 Mpixels/s (two arrays in parallel), or execute a fifth-order CIC filter at 400 Msamples/s.

THE LITTLE ENGINES THAT COULD The dynamic instruction set processor from GateChange Technologies employs an array of 32-by-32 pipelined reconfigurable processing elements. With this processor, designers can dynamically tailor the architecture and instruction resources by creating optimal-length instruction words. The words are part of a virtual instruction set that's added to the instruction set of the on-chip ARM7TDMI controller. The virtual instructions can be of any word width, from a single bit to thousands of bits.

Each processing element is a small arithmetic unit that can perform an 8-bit logic operation or a 4-bit multiplication. To support the 32-by-32 array of processing elements, 32 blocks of SRAM (each 2 kwords by 8 bits) provide the local data storage for the computations. A test chip based on the architecture, the 2KL1024, implements the dynamic instruction set and the full 32-by-32 processor array. Four high-speed serial I/O ports supply additional data-transfer interfaces. Some of the applications in the line of sight for the dynamic instruction set processor include large database searches, compares, matching, or various security applications (e.g., encryption or decryption). Also targeted are biometric applications such as fingerprint, hand, or palmprint recognition, video processing, and so on.

Another attempt at highly configurable signal processing is the PicoArray from PicoChip. The scalable, multiprocessor baseband IC integrates hundreds of processing elements into a single array that can deliver a throughput of 30 GMACs. The PicoArray PC101 combines an array of 16-bit processors, each with its own arithmetic units, processing elements, and both program and data memories. The processors are programmed individually during device initialization. The company estimates that each 16-bit processor has control capability close to that of an ARM9 CPU and DSP performance close to that of a TI C54xx series device.

Although the PicoArray is reconfigurable, it's not meant for applications that require cycle-by-cycle updates. Rather, it's intended for applications in which a reconfiguration request may take place every few hours or days. The company developed extensive code libraries that handle many communications functions.

Two additional processors, one from ChipWrights and the other from Morphics, are more fixed-architecture vector engines. The ChipWrights approach employs eight parallel datapaths and a central serial datapath, as well as a four-bank on-chip memory that's interleaved on a 32-bit basis and shared between the various datapaths.

The eight parallel datapaths implement vector operations, and they all perform the same operation on different data (SIMD). Unlike traditional vector architectures, however, each datapath has its own register file. Thus, each can be envisioned as operating by itself. Then, programmers don't have to think in parallel to use the engines. Rather, they can just concentrate on one datapath at a time.

The Morphics approach uses a programmable distributed dataflow architecture optimized for 3G baseband processing. Though it achieves a high throughput, its fixed architecture limits the flexibility. The first chip from the company performs all baseband receive and transmit channel processing required on a channel card between the digital antenna interface and the channel codec function, for up to 64 mobile phone lines. A control processor is used alongside the 3G-BP chip on the channel card. It performs the network termination and hosts the layer 1 software that manages the processing resources on the 3G-BP.

See associated web-only figure

Need More Information?

ChipWrights
John Redford (617) 928-0100
www.chipwrights.com

Cogent Chipware Inc.
Richard Hobson (604) 291-8395
www.cogentchipware.com

Context Inc.
John Scheiwe (352) 343-0661
www.contextdrl.com

Cradle Technologies Inc.
Phil Casini (408) 210-3600
www.cradle.com

Elixent Ltd.
Tony Stansfield (44) 117-917-5770
www.elixent.com

Forward Concepts Inc.
Will Strauss (480) 968-3759
www.fwdconcepts.com

GateChange Technologies
Ren Jenkings (610) 419-4700
www.gatechange.com

IBM Corp.
www.ibm.com

Intel Corp.
www.intel.com/go/imageprocessing

IPFlex Inc.
Jun Nakai (81) 3-5436-3861
www.ipflex.com

MathStar Inc.
Timothy Rhodes (952) 746-2225
www.mathstar.com

Morphics Technology Inc.
Ravi Sabramanian (408) 369-7227
www.morphics.com

Morpho Technologies
Todd Nash (949) 475-0626
www.morphotech.com

Motorola Inc.
www.motorola.com

PACT XPP Technologies Inc.
Ron Mabry (408) 392-3756
www.pactcorp.com

PicoChip Designs Ltd.
Rodger Sykes 44 (0) 1225-469744
www.picochip.com

Quicksilver Technology Inc.
Ralph Haines (408) 574-3300
www.quicksilver.com

Silicon Hive
etra Doelman (31) 4027-42533
www.siliconhive.com

Synergetic Computing Systems
www.synputer.com

About the Author

Dave Bursky

Technologist

Dave Bursky, the founder of New Ideas in Communications, a publication website featuring the blog column Chipnastics – the Art and Science of Chip Design. He is also president of PRN Engineering, a technical writing and market consulting company. Prior to these organizations, he spent about a dozen years as a contributing editor to Chip Design magazine. Concurrent with Chip Design, he was also the technical editorial manager at Maxim Integrated Products, and prior to Maxim, Dave spent over 35 years working as an engineer for the U.S. Army Electronics Command and an editor with Electronic Design Magazine.