Scalable Bus-Based Architecture Handles Multiple DSP Cores On-Chip

Emerging communication networks and broadband Internet access terminals have a voracious appetite for performance. They need powerful DSP engines that can handle billions of operations per second. Although the processing capability of DSP chips has increased significantly over the last ten years, the demand for performance has increased even faster.

Consequently, designers are exploring new architectures that will exploit multiple DSP cores on the same chip to provide a several-fold improvement in performance over conventional single-DSP design. To meet these challenges, researchers at Lucent Technologies' Bell Labs have developed a new approach to building powerful DSP engines that can provide processing that is 16 times faster than single-core solutions.

The Bell Labs' architecture, known as Daytona, is a scalable bus-based platform that allows multiple DSP cores to be integrated on the same silicon. As a result, these cores can share the same communications and memory resources available on the chip. It supplies an aggressive memory hierarchy that minimizes bus traffic and meets the performance requirements of large multichannel signal processing applications with the smallest possible memory footprint, according to Joseph Williams, a member of the technical staff at Bell Labs. He says that it provides an environment to make application development for DSP-based systems-on-a-chip (SoCs) simpler.

Aimed at improving instruction-level parallelism on-chip, Daytona is designed to support a wide array of cores, Williams says. The Daytona processors are identical and interchangeable, so the software demands are eased. With proper static and dynamic scheduling, an existing RTOS can be used to partition multiple tasks. Toward that end, Lucent also is developing software that would address the programming aspect of the solution.

To evaluate the Daytona approach, the architecture was first implemented with four programmable processing elements (PEs) connected to the high-performance, 32-bit address, 128-bit-data split-transaction bus (STBus). Each PE in this chip implements a controller that manages the flow of data between the PEs and the shared memory, while the I/O controller handles the data flow on and off the chip (see the figure). At a 100-MHz clock frequency, this chip—which is based on four PEs—performs 1.6 billion 16-bit MAC operations per second, while consuming 4 W with a 3.3-V supply. And, it is implemented in a 0.25-µm CMOS process.

Each PE comprises a 32-bit reduced instruction-set computing (RISC) core with a SPARC V8 instruction-set architecture that is tightly coupled to a 64-bit single-instruction multiple data (SIMD) coprocessor, identified as the reduced-precision vector unit (RVU). Together, they operate as a two-issue long-instruction-word (LIW) machine. Plus, each PE includes 8 kbytes of L1 cache configured as 16 banks of 512 bytes. Each bank can be reconfigured dynamically as instruction cache, data cache, or local buffer under software control. Consequently, cache configuration can be changed within an application via system calls, says Williams.

For embedded-software development, the PE also includes a hardware debug system. Since the STBus is cache coherent, the memory access has fixed latency with predictable real-time execution. To reduce clock uncertainty, each PE incorporates a delay-locked loop (DLL) to align the PE clock with the global clock.

Pushing the processing envelope, researchers are now developing a second Daytona chip with 32 PEs operating at 200 MHz. The goal here is to achieve 16 times the performance of the first implementation, while demonstrating that the architecture can deliver massively parallel bus-based systems with unprecedented performance. Additionally, the chip's layout has been optimized to keep the power consumption low.

Early results presented at last month's ISSCC in San Francisco indicate that the second chip consumes about 12 to 16 W at 3.3 V. The power dissipation in the first implementation was higher because the focus was on performance. Power consumption was secondary, according to Bryan Ackland, who heads the DSP and VLSI systems research department at Bell Labs. The second 32-PE chip, however, was optimized for minimal power consumption.