Advanced VLIW Architectures Unleash Raw DSP Horsepower

A new wave of DSPs boasts a tenfold improvement in signal processing while slashing power to a new low.

May 15, 2000

19 min read

Emerging broadband wireless basestations and handheld phone services, as well as other consumer multimedia systems, are demanding more processing horsepower from programmable DSPs. Simultaneously, the power-consumption and operating-voltage requirements for these applications are dropping.

Serving the insatiable appetite of these forthcoming systems—where voice, video, audio, and data all are converging—requires a dramatic improvement in performance. The present computational levels of a few hundreds of millions of instructions per second (MIPS) or hundreds of millions of floating-point operations per second (MFLOPS) aren't adequate anymore. Future applications will need an order-of-magnitude improvement in performance. It's not surprising, then, that designers are calling for several billions of instructions per second (BIPS) and billions of floating-point operations per second (GFLOPS) from a single DSP engine.

Toward that goal, major DSP suppliers have released a wave of DSPs that signals a new era in performance. They've accomplished this by substantially revamping their existing very-long-instruction-word (VLIW) cores and crafting variations of superscalar structures. Some companies have even combined the best of the VLIW and superscalar worlds to push the performance bar to the next level. While these advanced VLIW or highly parallel superscalar DSP architectures promise to deliver better than a tenfold improvement in processing, power consumption also has been reduced to a record low.

Interestingly, as these DSPs proliferate into a wide range of applications, the market pie is getting fatter and the competition is getting stiffer. Companies are looking at a market for programmable DSPs of all kinds that will hit $6 billion this year and surge at an average annual growth rate of over 34% in the future, according to market analyst Will Strauss of Forward Concepts in Tempe, Ariz. Also, more newcomers are throwing their hat into the ring. As time-to-market becomes a critical factor in this race, the traditional and new players alike will support their architectures with efficient high-level-language C compilers and integrated development environments.

Indeed, the architectures are tailored to be compiler friendly, as compilers are tweaked to tap every register on the chip. The new architectures are backward-compatible as well. Developers can reuse valuable software code and engineering, accelerate the development time, and cut overall system cost.

Leveraging its advanced VLIW architecture, Texas Instruments Inc. has revamped its VelociTI platform to create a new 16-bit fixed-point DSP core known as the C64x. Offering a tenfold improvement over the flagship C62x DSP core, the VelociTI.2-based C64x boasts a clocking speed of up to 1.1 GHz and processing performance near 9 BIPS.

That kind of processing prowess is being aimed at third-generation (3G) wireless basestations and xDSL modems. For feature-rich portable and personal products, TI has released a superset of the popular 16-bit C54x integer core. The C55X dual-MAC-based core is tailored for ultra-low power consumption while doubling the number of instructions per clock cycle. With the ability to clock at 400 MHz, it can deliver performance up to 800 MIPS (Fig. 1). By comparison, the previous-generation C54x runs at 200 MHz.

The new C55x core is architected to cut power consumption down to 0.05 mW/MIPS, which is six times lower than its predecessor. Also, it offers a scalable word length that reduces code size by 30% for optimal memory use. It emphasizes the power efficiency that will be needed in forthcoming feature-rich next-generation wireless handsets, which intend to "roll voice, data, and streaming video into one single product," says Mark Mattson, marketing manager for TI's C5000 platform. "Since it is backward-code-compatible with the C54x, it will provide an easy upgrade path for builders of next-generation cellular phones and other consumer devices like digital audio players and digital cameras."

Advanced power-management techniques implemented on chip automatically power down inactive peripherals, memory, and core functional units tominimize consumption and maximize power efficiency. Designers can customize the power management to their specific application via user-configurable idle domains. This feature gives the designer up to 64 configurable combinations of power management for the CPU, cache, peripherals, DMA controller, clock generator, and the external memory interface. The C55x core also now features wider bus widths and more buses to obtain much higher data throughput on and off the chip.

To accomplish faster data reads and writes, the core incorporates three data read buses, two data write buses, a 32-bit program bus, and six 24-bit address buses. Unlike the C54x, which uses a 16-bit external memory interface bus, the C55x employs a 32-bit version to speed up the data flow. It also provides a number of memory options, such as synchronous burst RAM, synchronous DRAM, ROM, and flash.

Likewise, the C64x DSP packs more features than the previous-generation C62x. It includes twice as many on-chip registers, level-2 cache, ten special-purpose instructions to enhance parallelism, multiple data types to perform more operations per clock cycle, improved orthogonality, 25% code-size reduction, and clever logic techniques to boost speed without penalizing power.

While the extensions support quad 8-bit and dual 16-bit operations, the wider 64-bit load/store data paths produce much higher throughput. The C64x core offers two complete sets of compute resources. Each set comprises four units, labeled L, D, S, and M. The L, D, and S units conduct basic integer arithmetic operations. The M unit performs multiple 16- or 8-bit multiplications, Galois multiplications, and special operations like bit shuffling, shifting, and rotations. The ten special-purpose instructions accelerate key tasks within digital communications, imaging, and video applications. Some of these instructions simplify computations in error-correction codes. Others improve motion-estimation algorithms and data density.

All of these improvements mean that applications that were unthinkable before are now possible. For instance, TI says that the core can implement up to 32 full-rate DSL channels on a single chip or hundreds of voice-over-IP (VoIP) lines. Additionally, for wireless systems, the core could form the basis of an ASIC or standard solution that could handle up to 64 voice/data channels or high-quality video transmission to personal terminal devices. In the industrial arena, the raw horsepower and high throughput could be applied to provide a fivefold improvement in 3D imaging or a tenfold enhancement in machine vision, TI says. With consumer applications, the firm believes that the core could be applied to provide MPEG-2 audio and video decoding functions, as well as picture formatting in an HDTV receiver.

Although standard solutions may come later in the year, Nokia Mobile Phones is already developing custom devices based on the C55x core for handset applications. The initial versions of these cores will be implemented in a 0.15-µm CMOS process, with plans to quickly migrate to 0.12 µm. The first derivatives of the C55x are expected to be released this spring. Also, the early C55x designs will operate at 1.5 and 0.9 V.

With migration toward finer geometries, the power-supply requirements will be shrunk to 0.75 to 0.7 V. Slated for sampling sometime this summer, the C64x will initially run at 700 to 800 MHz. Specific ASIC solutions derived from this core are expected to be released in the second half of this year.

Both of the new cores are software-compatible with previous generations and supported by the eXpressDSP, TI's integrated development environment (IDE) launched last fall. TI has even refurbished the IDE's major ingredients, such as the Code Composer Studio and the real-time DSP/BIOS kernel. Code Composer Studio 1.2 comes with visual linking and profile-based compilation capabilities, letting users graphically optimize code size and performance tradeoffs.

The extended DSP/BIOS II kernel features a multitasking scheduler, I/O control, real-time analysis, and real-time data exchange (RTDX), giving DSP developers flexibility, scalability, and ease of implementation. "Integrated with the Code Composer Studio 1.2, the BIOS II kernel allows designers to abstract via an API," says DSP/BIOS II product manager Dan Davis. "Optimized for the C64x and C55x architectures, the DSP/BIOS II kernel requires minimal memory."

"In essence, the eXpressDSP is a completely uniform development environment for both C5000 and C6000 DSP platforms," notes Rich Scales, product manager for TI's Compiler Technology. The new extensions include profile-based compilation (PBC) for the C6000 platform, along with a Visual Linker to the Code Composer for the C5000 DSPs. Presently unique to C6000 DSPs, the PBC permits users to graphically select the optimum combination of code size and speed for the intended application. Consequently, it automates the evaluation of multiple options for each software function to provide optimum performance for a given code size, or the best code size for a given performance level.

Support for C++ is another addition to the C6000 DSP C compiler. Though C++ doesn't improve efficiency, it adds a higher level of abstraction. "We view it as a front-end piece. The compilation efficiency will depend on how good the C compiler is," Scales says. As a result, TI has focused on improving the out-of-the-box C code performance as well as optimizing it for a specific architecture.

Plans are under way to extend C++ support to the C5000 platform in the near future. Meanwhile, users of C5000 DSPs can enjoy the benefits of the Visual Linker component of Code Composer Studio 1.2. This linker is a piece of the tool chain that places the code in memory. Each DSP has its own memory map that visualizes and simplifies system memory allocation, enabling the user to see where the code is residing and how fast it is performing.

TI isn't alone in promoting compiler-friendly architectures and C++ for DSP programming. More and more DSP designers are leaning toward C. Nowadays, every small and large DSP provider is coming into the market battlefield armed with an efficient C compiler. It has become a critical weapon in their development arsenals. Lately, they've been adding an objected-oriented high-level programming language to it, too.

Analog Devices Inc. is another major contender that sees C++ as a natural progression to aggressively shorten time-to-market. "Our implementation of C++ takes this a step further by providing easier access to specialized features in DSP architectures," explains Geoff Millard, compiler manager for DSP tools at ADI. "As memory sizes and program complexity increase, DSP software developers are beginning to run into issues that non-DSP C programmers ran into several years ago—most notably, a lack of data encapsulation."

Millard adds that "because C++ is an extension to C, C++ has become the de facto programming language for many software projects in non-DSP applications. ADI expects DSP applications will follow suit. This is a vehicle for portable-language enhancement, and it facilitates code portability and reusability. There is no inherent C++ penalty, as it offers the same speed and code compactness of C."

Internal tests indicate that the compiled code for an innermost loop of an FFT performs equally well for either source form (see the code listing). This object-oriented capability is fully integrated within ADI's VisualDSP development environment, and it complies with the embedded C++ standard. It also has been tweaked to support both fixed- and floating-point DSP architectures offered by ADI. These include SHARC, TigerSHARC, and ADSP-218x/219x families. While the beta version of the C++ compiler has been unveiled for SHARC and TigerSHARC members initially, the fixed-point series ADSP218x/219x is slated to get C++ support in the fourth quarter of this year.

Speaking of ADI's newest TigerSHARC (see "Extreme Levels of Parallelism Escalate DSP Horsepower," electronic design, Nov. 22, 1999, p. 71), the TS001 is the first implementation of this static superscalar core with VLIW benefits. With its unique ability to process 8-, 16-, and 32-bit fixed- and floating-point operations from a single engine, this highly integrated chip is targeting telecom-infrastructure applications. On board the TigerSHARC core, it integrates 6 Mbits of SRAM, four bidirectional link ports at 150-Mbyte/s transfer rates per port, a 64-bit external port with a 600-Mbyte/s data movement rate, 14 DMA channels, and 128 registers.

"Historically, programmers were constrained to one data type. Now, with the TS001, algorithms can be written in different formats to achieve the best tradeoff of speed and accuracy," says product line manager Gerry McGuire. "It provides the right combination of memory, throughput, and DSP horsepower to enable optimal system performance."

System applications may require even more processing. In that case, the links permit the designer to connect TS001 DSPs in multiprocessing configurations. ADI adds that the TigerSHARC architecture is compiler-friendly. Also, it's supported by an efficient C compiler that can deliver nearly 70% efficiency compared to hand-assembled code.

Capable of executing 1.2 billion 16-bit fixed-point MACs/s or 300 million 32-bit floating-point MACs/s, the TS001 is implemented in 0.25-µm CMOS. The initial version runs at a 150-MHz clock speed. While the core operates at 2.5 V, the I/Os are tailored for a 3.3-V supply. Faster versions running up to 250 MHz are in the works. Sampling now, the 150-MHz TS001 should go into production by year's end.

Also in this race to deliver solutions to basestation sockets, modem banks, and Internet telephony is Motorola Inc. Exploiting the 16-bit compiler-friendly SC140 DSP core, built by StarCore of Atlanta, Ga., Motorola Inc. has readied the MSC8101 network processor. The first implementation of the SC140 core, StarCore's SC140 is the result of a joint development effort between Lucent Technologies' Microelectronics Group and Motorola's Semiconductor Sector.

Similarly, Lucent Technologies also is preparing solutions derived from the SC140 core. As of press time, though, details of Lucent's derivation of the SC140 were unavailable.

Researchers at Lucent's Bell Labs have constructed a scalable bus-based platform for multiprocessing on the same chip. To evaluate this approach, four 100-MHz processing elements (PEs) and a global resource controller are connected to a 32-bit address and a 128-bit data split-transaction bus to perform 1.6 billion 16-bit MACs/s.

Employing system expertise and existing intellectual properties (IPs), Motorola has wrapped this highly parallel and scalable core with unique peripherals and functions. These include a communications processor module (CPM), an enhanced filter coprocessor (EFC), a PowerPC bus interface, a memory extension port, 512 kbits of SRAM, a 16-channel DMA engine, a serial interface unit, a programmable interrupt controller, and an emulation controller (Fig. 2).

The CPM, a RISC protocol-processing engine, supports direct connection to high-speed packetized backbone networks. At the same time, the EFC performs filtering tasks like echo cancellation. "The EFC coprocessor provides an additional 300 million MACs on top of the core's 1200 million MACs at a 300-MHz clock," explains Dave Baczewski, strategic marketing manager for Motorola's Wireless Systems Group. "Thanks to on-chip CPM and SRAM, the MSC8101's protocol algorithms can be dynamically updated to stay abreast with evolving standards and user needs." Benchmarks generated for Viterbi decoder and control code indicate the SC140's superior execution speed and program memory use (see the table).

Designed for wireless and wireline infrastructure equipment, the SC140 was fabricated in the company's HiperMOS (HiP-6) process. This procedure boasts 0.13-µm CMOS feature sizes and copper interconnects. Of the total 500-mW dissipation, the SC140 core consumes half of that used at a 1.5-V supply. Like these other designs, the I/Os operate at a 3.3-V supply. Motorola plans to sample the MSC8101 in the third quarter of this year, with production slated for the second quarter of 2001. It will be packaged in a 332-ball pad BGA.

Other versions of the MSC8101 are in the works for emerging applications. Motorola's roadmap calls for higher-speed models based on HiP-7 or 0.1-to 0.08-µm geometries. At these sizes, the core will run at 1.2 V and further cut power usage.

Development support includes an optimized C compiler that maximizes the use of parallelism and takes full advantage of the SC140's multiple execution units. In effect, it closes the gap between a high-level language compiler and hand-coded assembly language. Internal benchmarks on an enhanced full-rate vocoder for GSM indicates that the C compiler for the SC140 demonstrates high cycle performance and code density. Recently, Embedded Power Corp. unveiled a real-time operating system, known as the RTXC, for the SC140-based solutions.

According to Motorola, this compiler also will simplify migration from other architectures like the 16-bit fixed-point 56300 family. Fundamentally, the SC140 is a variable-length execution set (VLES) with explicitly parallel-instruction computing (EPIC). "It combines the best of both VLIW and superscalar architectures," explains Scott Beach, development engineer at StarCore.

While high-end DSPs are migrating toward C, the older 16-bit fixed-point generations like the 56300 continue to bank on assembly code. To further streamline existing assembly language routines, The MathWorks Inc. and Motorola have jointly developed the DSP Developer's Kit. This tool lets users verify the behavior of assembly language routines. By inserting such programs into The MathWorks' MATLAB and Simulink system-level environments, engineers can validate, modify, and test their assembly code while catching coding errors up front. They also can verify the operation of the software on cycle-accurate 56300 and 56600 simulators.

"Prior to this tool, DSP programmers experienced difficulty analyzing the behavior of their assembly code," notes Anne Mascarin, DSP market segment manager at The MathWorks. "Consequently, it was difficult for engineers to determine whether the assembly code was performing the functions for which it was designed."

To facilitate FPGA adoption in mainstream DSP applications, The MathWorks has entered a strategic alliance with Xilinx Inc. Its mission is to give designers a way to develop high-performance DSP systems on FPGAs using The MathWorks' system design and verification tools. In reality, the two partners have been working quietly for two years to create a solution that automatically translates system-level design into FPGA implementation. "It will open up the usage of FPGAs to the DSP designer community," says Per Holmberg, product marketing manager at Xilinx.

"Today, the price gap between high-density FPGAs and ASICs or single-chip DSPs is negligible," Holmberg adds. "They have evolved to become production parts. And, they can significantly reduce development time. This tool will enable designers to build and verify an entire DSP system and then automatically generate an HDL representation compatible with Xilinx FPGA implementation (Fig. 3). It will automatically map the DSP design to the Xilinx LogicCORE building blocks for optimal implementation and lowest silicon cost."

Traditionally, DSP developers have opted between an ASIC/ASSP and a programmable DSP processor. An FPGA combines the performance and system integration of ASICs/ASSPs with the reprogrammability and time-to-market benefits of processors. The Xilinx system generator for The MathWorks' Simulink interface is expected to be released in the third quarter of this year.

Unlike traditionally bus-oriented architectures, BOPS Inc. has taken the cluster-switch approach. With three levels of parallelism involving indirect VLIW, SIMD, and MIMD on top of a cluster switch, BOPS has generated three sets of a synthesizable Verilog DSP core for system-on-a-chip (SoC) solutions. Adopting a licensing strategy, the design house has crafted single-, dual-, and quad-processor versions for array processing in DSP applications.

BOPS provides a complete tool chain that includes a GNU C compiler. This tool allows direct conversion of MATLAB models into BOPS-optimized assembly code without going through C. For those who prefer C, BOPS also provides a parallelizing C compiler that exercises parallelism at three different levels for generating compact code. Vivsis, a subsidiary of Mitsubishi, and SiByte, a MIPS-architecture startup, license BOPS' proprietary cores. In fact, SiByte is developing SoCs for Internet appliances. Concurrently, BOPS is looking to improve the fundamental core with the addition of the PCI bus, a 64-bit DRAM bus, and a 32-bit MIPS interface.

Others following the licensing path for growth include LSI Logic, Infineon, Massana, and 3DSP. Philips Semiconductors and Hitachi Semiconductor are eying emerging sockets in this fray as well. To strengthen its position in high-growth wireless infrastructure and networking applications, LSI Logic has expanded its CoreWare ASIC library with ZSP's 400-MIPS superscalar DSP engine. For its ZSP core, the company has garnered support from Broadcom Corp., Brecis Communications, and TollBridge Technologies—a developer of IP-based multiline voice solutions. Brecis intends to employ the ZSP400 core in VoIP solutions.

To further add vitality to the ZSP400 core, LSI has added an addressing mode register for fast Fourier transforms, eight more shadow registers for context switching and low-latency interrupts, and two extra loop registers for minimizing the code. Plus, it has increased the memory size to address large programs as well as split the memory into data and program banks. LSI's software tools manager Prasad Kalluri adds that the associated compiler has been streamlined with better machine description for 10% faster code generation, since tools are important. A new cycle-accurate simulator replaces the older model.

Though not as powerful as other VLIW-based cores, Infineon continues to refine the configurable long-instruction-word CARMEL core with additional bells and whistles to lure system-level DSP designers. The latest CARMEL model, the DSP20xx, comes with a PowerPlug accelerator that lets developers configure the instruction set and modify the core. Subsequently, the new accelerator can implement computation-intensive features like multiple data rates and complex modulation schemes without compromising power dissipation and system costs. Designed to operate at frequencies up to 300 MHz, the latest CARMEL is expected to be available in the fourth quarter.

Traditionally, VoIP implementations and cable-modem applications have demanded separate DSP and RISC microcontrollers. DSPs efficiently perform telephony middleware tasks, and RISCs execute control tasks adequately. In turn, Hitachi Semiconductor is considering packing multiple 133-MIPS SH3-DSP cores on a single die.

"A DSP has a lot of deterministic requirements," says Peter Carbone, manager for Hitachi's microprocessors and microcontrollers. "There is uncertainty with a RISC+ DSP hybrid structure like the SH3-DSP. By using multiple cores, the device can be architected to do more DSP tasks on one, and put control tasks on another." Hitachi is evaluating such an approach with two cores on a single chip using a 0.18-µm CMOS process. Standard multiple core-based devices are planned for release in 2001, according to Hitachi.

To improve the performance of the 133-MIPS SH3-DSP, its developers are optimizing the chip's instructions and enhancing its speed. Simultaneously, a better C compiler is in the works that will generate assembly code for the SH3-DSP. In addition to speeding up development of VoIP, the company has ported a media-gateway control protocol to the SH3-DSP, as well as readied a VoIP reference design with telephony middleware.

Suppliers Of DSP Chips And Cores
Analog Devices Inc. (781) 329-4700 www.analog.com BOPS Inc. (650) 330-8407 www.bops.com Hitachi Semiconductor (America) Inc. (800) 285-1601 www.hitachi.com Infineon Technologies Inc. (408) 501-6880 www.infineon.com LSI Logic Inc. (408) 433-6359 www.lsilogic.com Lucent Technologies (800) 372-2447 www.lucent.com Massana Inc. (408) 871-1415 www.massana.com	Motorola Inc. (512) 933-6300 www.motorola.com Philips Semiconductors (408) 991-3518 www.semiconductors. philips.com STMicroelectronics Inc. (781) 861-2650 www.st.com Texas Instruments Inc. (972) 644-5580 www.ti.com 3DSP Corp. (949) 260-0156 www.3dsp.com Xilinx Inc. (800) 255-7778 www.xilinx.com