Many computationally intensive DSP applications can take advantage of a specialized compute array tuned for the task at hand. This method runs rings around a commercial DSP solution. Or, such an array can supplement the DSP to perform the time-critical computations.
Such solutions, of course, can readily be implemented as part of a custom-designed, high-volume chip. But when volume doesn't justify an ASIC implementation, FPGAs offer flexible hardware-based alternatives to accelerate complex algorithms.
All FPGA suppliers, including Actel, Altera, Lattice, QuickLogic, and Xilinx, offer libraries of DSP functions that can be configured on the FPGA logic fabric. The most popular library element, the multiplier or multiplier-accumulator (MAC), appears in almost every DSP algorithm. Implementing the MAC function with configurable logic cells will typically achieve top computational throughputs of 100 to 150 MHz (150 million single-cycle 8- or 16-bit multiply-accumulates/s).
With the large number of gates available on the latest generations of FPGAs, multiple MACs can be implemented on one chip. So designers can create a computational array that makes short work of large problems. Still, such arrays are very area-inefficient compared to dedicated MACs. That's why Altera, QuickLogic, and Xilinx have all embedded dedicated, DSP-optimized, computational blocks on various FPGA families.
At the high end, Altera's Stratix II family includes as many as 96 DSP blocks on its largest FPGA, the EP2S180. Each block can implement eight full-precision 9- by 9-bit multipliers, four 18- by 18-bit multipliers, or one 36- by 36-bit multiplier. Therefore, the chip can provide a total of 96 36-bit, 384 18-bit, or 768 9-bit multipliers that run at up to 450 MHz (Fig. a).
When clocked at full speed, the 18-bit multipliers deliver an aggregate throughput of over 172 GMACs/s. For 9-bit multipliers, the throughput reaches 344 GMACs/s. That's the highest level achieved for an off-the-shelf solution.
The blocks include shift registers to better implement DSP functions such as finite and infinite impulse response filters and other algorithms. Four main operating modes are possible with the DSP block: multiplier, multiplier-accumulator, two multipliers and an adder, or four multipliers and an adder.
The Xilinx Virtex-4 family incorporates XtremeDSP building blocks in all three platform FPGA families-the LX, SX, and FX. The SX platforms contain the largest number of 18-bit multipliers, with as many as 512 on the XC4VSX55. Each XtremeDSP block can implement an 18- by 18-bit two's-complement multiplier running at 500 MHz while consuming just 57 mW/MHz, or about 15% of the power consumed by the company's previous-generation DSP blocks (Fig. b).
With all multipliers in use, the aggregate throughput of the XtremeDSP blocks hits 256 GMACs. That meets the needs of many compute-intensive applications. The blocks support over 40 dynamically controlled operation modes, including multiplier, multiplier-accumulator, multiplier-adder/subtractor, three-input adder, barrel shifter, wide bus multiplexers, or wide counters.
Dedicated multiplier blocks also can be found on Altera's Stratix and Cyclone II families of FPGAs; QuickLogic's QL6250 Eclipse-E FPGA; and Xilinx's Virtex II, IIPro, and Spartan series. Clock speeds for the 88 18-bit Stratix series multipliers can hit 333 MHz. For the largest Cyclone II FPGA, 150 18-bit (or 300 9-bit) multipliers can operate at up to 250 MHz.
When it comes to the Virtex II and IIPro families, 300-MHz-plus is possible for the 444 18-bit multipliers. Speeds can hit 325 MHz for the maximum of 36 18-bit multipliers in the Spartan series. The 10 8-bit Eclipse-E FPGA's MACs aren't quite as fast, topping out at 100 MHz.
Stretch Inc. offers a new approach to compute engines. The company combined a 32-bit Tensilica processor core with programmable logic tightly integrated into the datapath (the equivalent of 300k to 500k ASIC gates). By incorporating the programmable logic, the processor instruction set can be hardware-optimized to create application-specific instructions to more efficiently execute DSP algorithms (Fig. c).
Up to 64 16-bit multipliers and 400 16-bit ALUs can be simultaneously implemented in the programmable fabric. Designers can isolate the computational kernels into C-language functions. Then, the Stretch compilers reduce the body of the C-functions to a configuration pattern for the logic fabric. The compiler replaces the function call in the executable file with a new instruction, then maps the function's parameters to a 128-bit-wide register file.
It's possible to reload the programmable fabric with different configuration patterns running at hundreds to thousands of times per second while the processor runs the application code. This permits the processor to execute multiple complex instructions that speed the execution of performance-critical DSP algorithms.