Reconfgurable Architectures Chart A New Course For DSPs

The latest crop of DSP cores promises flexibility to cope with the changing requirements of evolving third- and fourth-generation wireless standards.

Ashok Bindra

Aug. 5, 2002

14 min read

Add Us On Google

Ever-increasing computational complexity and evolving wireless and consumer standards are forcing developers of monolithic digital signal processing (DSP) chips to look beyond the traditional fixed architectures that have served the industry adequately for over a quarter of a century. While the fixed architectures continue to progress in capability, the MIPS and MOPS appetite of the emerging 3G and 4G wireless applications is soaring faster than Moore's Law.

Hence, as conventional fixed DSPs run out of steam, they're being combined with expensive application-specific ICs (ASICs) to supply the additional processing horsepower needed. Also, in many cases, multiple fixed DSP cores are being packed in parallel on one chip to address the huge processing requirements of these applications. Besides adding further cost and consuming more power, these parallel processing chips are difficult to program.

To overcome these limitations and offer a flexible, cost-effective solution, many new entrants to the DSP market are extolling the virtues of configurable and reconfigurable DSP designs. This latest breed of DSP architectures promises greater flexibility to quickly adapt to numerous and fast-changing standards. Plus, they claim to achieve higher performance without adding silicon area, cost, design time, or power consumption. In essence, because the architecture isn't rigid, the reconfigurable DSP lets the developer tailor the hardware for a specific task, achieving the right size and cost for the target application. Moreover, the same platform can be reused for other applications.

Because development tools are a critical part of this solution—in fact, they're true enablers—the newcomers also ensure that the tools are robust and tightly linked to the devices' flexible architectures. While providing an intuitive, integrated development environment for the designers, the manufacturers ensure affordability as well.

Reconfiguring The Architecture: Some of the new configurable DSP architectures are reconfigurable too—that is, developers can modify their landscape on the fly, depending on the incoming data stream. This capability permits dynamic reconfigurability of the architecture as demanded by the application. Proponents of such chips are proclaiming an era of "chip-on-demand," wherein new algorithms can be accommodated on-chip in real time via software. This eliminates the cumbersome job of fitting the latest algorithms and protocols into existing rigid hardware.

Toward that end, Chameleon Systems had developed a reconfigurable communications processor (RCP) that could be reconfigured for different processing algorithms in one clock cycle. But a softening in the 3G marketplace, combined with the difficulty of using new tools and programming a new architecture, forced Chameleon to rethink its reconfigurable DSP strategy.

Chameleon designers are revising the architecture to create a chip that can address a much broader range of applications. Plus, the supplier is preparing a new, more user-friendly suite of tools for traditional DSP designers. Thus, the company is dropping the term reconfigurability for the new architecture and going with a more traditional name, the streaming data processor (SDP).

Though the SDP will include a reconfigurable processing fabric, it will be substantially altered, the company says. Unlike the older RCP, the new chip won't have the ARM RISC core, and it will support a much higher clock rate. Additionally, it will be implemented in a 0.13-µm CMOS process to meet the signal processing needs of a much broader market. Further details await the release of SDP sometime in the first quarter of 2003.

While Chameleon is in the redesign mode, QuickSilver Technologies is in the test mode. This reconfigurable proponent, which prefers to call its architecture an adaptive computing machine or ACM, has realized its first silicon test chip. According to vice president of marketing John Watson, the chip demonstrates that the ACM hardware can adapt dynamically to the software.

In fact, the tests indicate that it outperforms a hardwired, fixed-function ASIC in processing compute-intensive cdma2000 algorithms, like system acquisition, rake finger, and set maintenance. For example, the ASIC's nominal speed for searching 2¹⁵ phase offsets in a basic multipath search algorithm is 3.4 seconds. The ACM test chip took just one second at a 25-MHz clock speed to perform the same number of searches in a cdma2000 handset. Likewise, the device accomplishes over 57,000 adaptations per second in rake-finger operation to cycle through all operations in this application every 52 µs (Fig. 1). In the set-maintenance application, the chip is almost three times faster than an ASIC, claims QuickSilver.

To demonstrate the adaptability of the architecture, each test algorithm was downloaded via software to the ACM fabric in real time. In this approach, the algorithm is broken into thousands of subsets, which are then dynamically mapped onto the hardware on the fly, Watson says. "This on-demand hardware results in the most efficient use of hardware in terms of cost, size \[real estate\], performance, and power consumption," he asserts. Although QuickSilver claims the ACM chip offers substantial improvement in power consumption over fixed architectures, the initial device has yet to be tested for power consumption.

Each computation unit inside the ACM engine is called a node. The test chip uses four ACM nodes, and the ACM chip employs a high-level derivative of C++ for programming. Details of the underlying ACM architecture aren't expected to be revealed until sometime next year.

Although QuickSilver's ACM test chip is based on 0.18-µm CMOS, the final commercial version, slated for release in the first quarter of 2003, is expected to be implemented in TSMC's 0.13-µm CMOS process. Also, the commercial ACM version will host 32 nodes on-chip. QuickSilver plans to license its ACM technology to other users.

Meanwhile, Morpho Technologies is in quiet mode. To fill the void left by FPGAs and ASICs, the supplier has developed a reconfigurable single-instruction, multiple-data (SIMD) DSP array, labeled reconfigurable DSP, or simply rDSP. According to Morpho, rDSP will enable 3G handsets to adapt on the fly regardless of native standards. At present, Morpho won't divulge any details on the technology.

Chameleon, QuickSilver, and Morpho aren't the only participants in this race. Other upstarts touting reconfigurable capabilities include Elixent, a spinoff of Hewlett-Packard (HP), PACT Corp., and RadioScape.

Elixent Ltd. has crafted a reconfigurable signal processing (RSP) technology for various compute-intensive mobile handsets and infrastructure systems, as well as digital imaging. The underlying RSP technology was first developed at HP's Research Labs for printers. Elixent has polished the technology to extend its reach to other embedded applications.

The key element in the RSP is a reconfigurable array of 4-bit arithmetic logic units (ALUs), registers, and embedded RAM, connected via switch boxes to a routing network (Fig. 2). Configured to send or receive data from any of its eight surrounding ALUs, this scheme facilitates flexible interconnectivity. Called a D-fabrix processing array, the ALUs can be programmed statically or dynamically via 4-bit instructions.

Networks of statically programmed ALUs can be configured into synchronous signal processing pipelines that can keep hundreds or even thousands of ALUs busy on each cycle. As a result, this structure yields massive instruction-level parallelism, generating a peak performance of up to 400 16-bit MOPS at 100 MHz for a 16-ALU array.

Because the reconfiguration time is on the order of tens of microseconds, the architecture permits a massive level of silicon reuse. "It's a fine-grained architecture wherein algorithms map efficiently into the architecture," notes Mike Buchanan, vice president of marketing at Elixent. As the structure can be changed to suit the task at hand, it permits the use of different algorithms at different times.

The first implementation of Elixent's D-fabrix processing array, the DFA1000 RSP is scalable, allowing 128 to 2048 arrays. The architecture also includes a set of peripherals to facilitate its integration into system-on-a-chip (SoC) designs. Peripherals like local high-speed RAM (directly accessible by the D-fabrix array), two high-speed data I/O ports (each 32 bits wide), and the 32-bit AMBA bus interface are available. The I/O ports can be configured as two 16-bit ports for audio applications or four 8-bit ports for imaging use.

Simulation results indicate that a DFA1000 with a 1024-ALU array can deliver 40.8 billion operations per second (BOPS) at a 150-MHz clock and 16-bit operations.

To demonstrate the methodology, Elixent has also readied test silicon with a 512-ALU array. For evaluation purposes, it developed a demo board for the 0.18-µm CMOS-based chip. However, Elixent is following the business model of DSP IP suppliers and is only interested in licensing its IP to users.

Realizing that the tools are key to the success of the architecture, Elixent aligned with several third parties to create a C-based tool chain that addresses both behavioral and functional input levels. Included is a place-and-route stage that's analogous to resource allocation in a very-long-instruction-word (VLIW) processor. Hence, the place-and-route tool allocates the array resources to the functions within the algorithm. Because the granularity of the RSP is much finer, it allows a far better match between algorithms and resources, explains Buchanan.

Likewise, Germany's PACT Corp. has crafted a reconfigurable array of processing elements in an extremely parallel format so that the chip can be configured on the fly to perform in several classical ways. These include single-threaded piepline processing, multithreading, multitasking, and multiprocessor DSP operations.

The first derivative of PACT's eXtreme processor platform (XPP) consists of 128 32-bit processing array elements (PAEs) that offer over 50-BOPS performance. The PAEs and corresponding I/Os are segmented into two processing array cluster (PAC) blocks. A supervising confirmation manager (SCM) governs the configuration handling to local configuration managers that further provide the necessary connectivity and processing for an algorithm. The flow of data packets is handled within the PACs. While the proof of concept was demonstrated in 0.25-µm CMOS, the XPU128 will be implemented in both 0.18- and 0.15-µm design rules.

PACT believes that XPP takes DSPs to the next level in performance. The technology can be combined with a traditional DSP core in SoC solutions for 3G and 4G wireless basestations, as well as other compute-intensive high-bandwidth applications, like media streams, data mining, simulation, and CAD. Consequently, PACT plans to license XPP IP for use as an algorithmic coprocessor for leading RISC CPU and DSP cores employed in SoCs.

Because XPP is designed to offer both modularity and scalabilty, PACT de-signers hope to integrate over 1000 PAEs on one die before the end of this decade. Right now, PACT is readying a version with more than 400 BOPS. Concurrently, the company has created an integrated XPP development suite, which includes a compiler and mapper for the native mapping language (NML), a simulator, and an interactive visualization and debugging tool.

By their very nature, FPGAs also lend themselves to reconfigurability. Exploiting this feature, FPGA makers like Xilinx have launched a major initiative to broaden the devices' role in DSPs. But due to size, power consumption, and cost, the FPGAs are primarily targeting basestation applications where quantities required are limited.

User Configurability: While some developers see the need for dynamic reconfiguration of the DSP architecture, others have adopted a user-configurable strategy. Unlike on the fly reconfiguration, configurable designs offer a way to extend the instruction set, scale the computational units up or down, and modify memory resources to generate a custom processor. With the right tools, the configured DSP can be quickly optimized for specific tasks. In short, the user can tailor the hardware for the end application. But once the architecture is configured at the time of design and implemented in silicon, it is fixed.

Many suppliers of configurable DSPs exist, such as Improv Systems, 3DSP, RadioScape, and Adelante Technologies. Like those in the reconfigurable camp, these players have adopted a licensing business model. They aim to provide their respective IPs to semiconductor makers and suppliers of SoC chips. So, the configurable cores aren't here to displace conventional fixed-DSP chips, but rather to complement them as coprocessors or accelerators in system-level solutions.

Improv Systems has developed a programmable system architecture (PSA) based on its configurable VLIW core, Jazz. A key feature of Jazz is that it permits a designer to add custom instructions or execution units via the PSA composer toolsuite. In essence, the Jazz processor incorporates a collection of single-cycle computation units (CUs) that perform a specific set of opcodes. Each predefined CU comprises a 32-bit ALU, a 32- by 32-bit multiplier with 64-bit accumulator, a 64-bit shifter, a 16-bit counter, and a byte swap unit. The hardware task queue functions enable rapid queuing and context switching, while the memory interface unit (MIU) includes multiple ports to read and write memory (Fig. 3).

Because many applications need flexibility and control, VLIW devices have become the choice processors. VLIW provides parallel execution of operations with the degree of parallelism determined at compile time, says Victor Berman, Improv's director of marketing. Configurable VLIW processors also offer advantages in the area of memory accesses by allowing the number of memory interface units (concurrent memory operations) to be configured. With appropriate tool support, this can be accomplished without additional logic design, and processor performance can be significantly increased through a simple configuration change, he continues.

Another significant configuration option is the ability to add parallel datapath elements, like ALUs and multiply-accumulate cells (MACs). In many signal-processing applications, this capability alone can significantly increase performance. For these applications, further performance can be gained by simply modifying the processor configuration with additional parallel MAC units. Beyond performance gains, this type of modification to the processor's configuration has the advantage of not requiring logic design for a new operation.

When running the telecommunications suite of EEMBC, the Jazz processor scored an overall telemark of 8.0, compared to 6.8 for the nearest competitor. That means Jazz is 20% faster than any other processor that has been benchmarked with the EEMBC suite, says Improv. The Jazz PSA platform is fully supported by a Java-based development system with an advanced compilation tool.

The compilation system provides partitioning, memory allocation, code generation, and optimization for a system-level solution. Based on this platform, Improv has developed a complete solutions kit, called Crescendo, for broadcast and mobile media applications. It contains all optimized hardware and software required, including a reference design and source code, to enable a customized solution.

To expand its portfolio into the DSP arena, licensable software solutions provider RadioScape has acquired U.K.-based DSP developer Systolix Ltd. Together, the partners have developed a synthesizable DSP core that's configurable and scalable. Employing a multiprocessor architecture, RadioScape's pulseDSP is based on an array of bit-serial MACs, which can be configured via a microprocessor.

The architecture is called systolic because each cell processes data and passes the results on to the next cell in synchronization with a common clock. Also, the architecture permits hundreds of MACs to be configured as a field-programmable processor array (FPPA) to process data concurrently (Fig. 4). Consequently, PulseDSP's performance is up to 200 gigaMACS at 16-bit operations when implemented in 0.15-µm CMOS, according to RadioScape.

Targeting next-generation basestations and multimode wireless handsets, the PulseDSP is supported by a system design tool flow that generates accurate hardware description language (HDL) and C models of the complete core.

Also making gains on this front are 3DSP and Adelante Technologies. While 3DSP continues to complement its configurable DSP cores with optimized applications software, it also has crafted configurable and extendible DSP cores for different wireless handsets and speech processing subsystems.

Adelante's Saturn DSP core offers a dual Harvard architecture that in-cludes two 16-bit multipliers, four ALUs, two address calculation units, a barrel shifter, a program control unit, a hardware loop control unit, a saturation and shift unit, and a bit-manipulation unit. It executes 420 million MACs at a 210-MHz clock, consuming only 0.25 mW/MHz and occupying 0.5 mm² of silicon.

The chip's instructions are optimized for execution of wireless and speech applications. To further exploit the core's resources, the architecture permits an additional 256 application-specific 96-bit VLIW instructions. Furthermore, the core hardware can be expanded with application-specific execution units to accelerate repetitive signal processing tasks like Viterbi butterflies or FFTs. As part of its subsystem strategy, Adelante also furnishes application-specific coprocessors with tight links to the Saturn core.

Interestingly, the developers of configurable and reconfigurable DSP cores are going after a market that's fiercely competitive and slowly emerging. Market research firm Forward Concepts estimates it at $273 million for this year, with a compound annual growth rate of 23.9% for the next few years.

Need More Information? Adelante Technologies
(310) 540-6541
www.adelantetech.com

Chameleon Systems
(408) 240-3300
www.chameleonsystems.com

Elixent Ltd.
+44 117 917 5770
www.elixent.com

Forward Concepts
(480) 968-3759
www.fwdconcepts.com

Improv Systems
(978) 927-0555
www.improvsys.com

Morpho Technologies (949) 475-0626 www.morphotech.com
(408) 392-3756
www.pactcorp.com

QuickSilver Technologies Inc.
(408) 574-3351
www.quicksilvertech.com

RadioScape
(650) 632-4514
www.radioscape.com

3DSP Corp.
(949) 435-0600
www.3dsp.com