Modern DSP Chips Serve Up Variations On A Theme

Digital signal processors (DSPs) earn their living by doing certain analog jobs better than analog circuitry. In some cases, where analog circuits can’t even be considered for a task due to cost or complexity reasons, DSPs are still a viable choice and in many cases perform those tasks effortlessly.

That’s because DSPs are very good and very fast at arithmetic operations such as addition and multiplication. Clever mathematicians and engineers exploit this fact by creating algorithms to tackle complex signal-processing tasks using mainly those two mathematical operators.

Today’s DSP chips are much more than just a pretty processing engine. Also integrated on these chips are memory subsystems, high-speed interfaces, I/Os, and more. These elements are included with the idea of increasing overall performance, lowering power consumption, and targeting particular processing tasks.

To better understand the various DSP chip options available and how different parts of the device fit together as a whole, it’s helpful to examine several representative DSPs on the market today. We’ll take a look at examples of single-core, single-core plus microcontroller, and multicore DSP chips.

SINGLE-CORE DSP CHIPS It’s natural to think that DSP chips have a single DSP core. Take, for instance, Texas Instruments’ TMS320C6452 (Fig. 1). A member of the TMS320C64x+ family of high-performance fixed-point DSPs, the chip targets process-intensive multichannel telecom infrastructure and medical imaging systems. The DSP core is just a part of the chip’s design, though. The rest of the chip comprises memory, I/Os, and other functional blocks.

The C6452 DSP integrates onchip memory organized as a twolevel memory system. The level 1 (L1) program and data memories are 32 kbytes each. This memory can be configured as mapped RAM, cache, or some combination of the two.

When configured as cache, the L1 program (L1P) is a direct mapped cache whereas L1 data (L1D) is a two-way set associative cache. The level 2 (L2) memory is shared between program and data space. L2 memory can also be configured as mapped RAM, cache, or some combination of the two. Designers can use the on-chip memory to add differentiating features to their projects.

The C6452 also includes two Serial Gigabit Media Independent Interface (SGMII) Ethernet media access control (MAC) ports and one gigabit switch. The switch improves the efficiency of multichip designs by automatically monitoring the data stream to ensure that only the appropriate TI added a decision gate to the switch that can, for example, be used to distinguish between voice and data traffic. If the DSP is dedicated entirely to voice processing, it can block data traffic from entering, which makes much more effective use of its processing bandwidth. In addition, the device comes with two telecom serial interface ports (TSIPs), providing a seamless connection to common telecom serial data streams.

Other I/Os on the C6452 include a 66-MHz PCI interface or Universal Host Port Interface (UHPI); a double-data-rate (DDR2) interface to external memory; VLYNQ, a proprietary serial communications interface developed by TI; a 16-bit external memory interface (EMIFA); a multichannel general-purpose audio serial port (McASP); and other familiar interfaces. Judging from this DSP’s I/Os, there’s no doubt its home will be in telecom applications. For other applications, a different set of I/Os would be in order.

At the heart of the C6452 and several other DSPs from Texas Instruments lies the C64x mega module, which consists of several components—the C64x+ processor, L1 program and data memory controllers, L2 memory controller, internal DMA (IDMA), interrupt controller, power-down controller, and external memory controller (Fig. 2). The mega module also supports memory protection for L1P, L1D, and L2 memories. It provides bandwidth management for resources local to the mega module as well.

The C64x+ processor on the module is a very fast DSP that can operate at speeds up to 1.2 GHz. It employs eight functional units, two register files, and two data paths. Two of these eight functional units are multipliers or M units. Each M unit performs four 16- by 16-bit multiply-accumulates (MACs) every clock cycle.

Thus, eight 16- by 16-bit MACs can be executed every cycle on the C64x+ core. At a 1.2-GHz clock rate, 9600 16-bit MMACs can occur every second. Moreover, each multiplier on the C64x+ core can compute one 32- by 32-bit MAC or four 8- by 8-bit MACs every clock cycle. By the way, the C6452 doesn’t operate at the fastest speed, topping out at 900 MHz.

A new feature of the C64x+ processor has the endearing name of the SPLOOP. This small instruction buffer aids in the creation of software pipelining loops where multiple iterations of a loop are executed in parallel. The SPLOOP buffer reduces the code size associated with software pipelining.

DSP + MICROCONTROLLER CHIPS Another class of DSPs employs an additional microcontroller core on chip. Sometimes this is a separate core, such as an ARM processor. In other cases, the processor core contains both DSP and MCU functionality. This the case with the wellknown Blackfin DSP architecture from Analog Devices.

The Blackfin is based on a 10-stage RISC MCU/DSP pipeline with a mixed 16/32-bit instruction set architecture, which includes dual 16-bit MAC DSP instructions and a 32-bit RISC-like instruction set. This combination provides signal-processing functionality with the ease-of-use attributes associated with general-purpose microcontrollers. The Blackfin processor architecture is fully SIMD-compliant (single-instruction, multiple-data) and includes instructions for accelerated video and image processing.

This combination of processing attributes differentiates Blackfin processors from their brethren. They’re designed to perform equally well in both signal-processing and control-processing applications, in many cases eliminating the requirement for separate heterogeneous processors in a design. Blackfin processors offer up to 756 MHz in single-core products.

Continue to page 2

Beyond native support for 8-bit data, which is the word size common to many pixel-processing algorithms, the Blackfin architecture includes instructions specifically defined to enhance performance in video-processing applications. For instance, the “SUM ABSOLUTE DIFFERENCE” instruction supports motion-estimation algorithms used in video-compression algorithms such as MPEG2, MPEG4, and JPEG.

The architecture handles multi-length instruction encoding. Very frequently used control-type instructions are encoded as compact 16-bit words, with more mathematically intensive signal-processing instructions encoded as 32-bit values. The processor will intermix and link 16-bit control instructions with 32-bit signal-processing instructions into 64-bit groups to maximize memory packing. When caching and fetching instructions, the core automatically fully packs the length of the bus, since it doesn’t have alignment constraints.

All Blackfin processors, such as the ADSP-BF523, contain independent DMA controllers that support automated data transfers with minimal overhead from the processor core (Fig. 3). DMA transfers can occur between the internal memories and any of the many DMA-capable peripherals. Transfers can also occur between the peripherals and external devices connected to the external memory interfaces, including the SDRAM controller and the asynchronous memory controller.

Memory architecture includes both L1 and L2 memory blocks. L1 memory is connected directly to the processor core, runs at full system clock speed, and offers maximum system performance for time-critical algorithm segments. Also, L1 memory can be configured as SRAM, cache, or a combination of both.

By supporting both SRAM and cache programming models, system designers can allocate critical real-time signal-processing data sets that require high bandwidth and low latency into SRAM, while storing “soft” real-time control and operating- system (OS) tasks in the cache memory. L2 memory is a larger, bulk memory storage block that offers slightly reduced performance, but is still faster than offchip memory.

Every Blackfin processor employs multiple power-saving techniques based on a gated-clock core design that selectively powers down functional units on an instruction-by-instruction basis. These processors also support multiple powerdown modes for periods where little or no CPU activity is required.

In this self-contained dynamic powermanagement scheme, the operating frequency and voltage can be independently manipulated to meet the performance requirements of the algorithm currently being executed. Most Blackfin processors offer on-chip core voltage-regulation circuitry as well as operation to as low as 0.8 V, and they’re particularly well suited for portable applications that require extended battery life.

Blackfin processors come with a variety of microcontroller-style peripherals, including 10/100 Ethernet MAC, UARTs, SPI, CAN controller, timers with pulsewidth- modulation (PWM) support, watchdog timers, real-time clock, and a glueless synchronous and asynchronous memory controller.

MULTICORE DSPs A good example of a multicore DSP is Freescale’s MSC8144 DSP, which is based on the company’s StarCore technology– specifically the third-generation SC3400 DSP core.

The chip incorporates four DSP subsystems. Within each subsystem is an SC3400 DSP core, 16-kbyte L1 instruction cache, 32-kbyte L1 data cache, memory management unit (MMU), extended programmable interrupt controller (EPIC), and two general-purpose 32-bit timers. The subsystem has debug and profiling support and low-power Wait and Stop processing modes. Each DSP core runs at up to 1 GHz, so the chip delivers the equivalent performance of a 4-GHz single-core DSP.

The MSC8144 also contains the company’s QUICC Engine technology subsystem, which includes dual RISC processors, 48-kbyte multi-master RAM, and 48-kbyte instruction RAM. This subsystem supports three communication controllers with one asynchronous transfer mode (ATM) and two Gigabit Ethernet interfaces. It can offload scheduling tasks from the DSP cores as well.

The ATM controller supports UTOPIA level II 8/16 bits at 25/50 MHz in UTOPIA/POS mode with adaptation layer support for AAL0, AAL2, and AAL5. The two Ethernet controllers support 10/100/1000-Mbit/s operations via MII/ RMII/SMII/RGMII/SGMII and the SGMII protocol using a four-pin serializer/ deserializer (SERDES) interface at a 1000-Mbit/s data rate only.

Like the DSP chips mentioned earlier, this one surrounds the DSP and QUICC subsystems with memory, interfaces, and I/Os. As for memory, the chip contains 128-kbyte L2 shared instruction cache, 512-kbyte M2 memory for critical data and temporary data buffering, 96-kbyte boot ROM, and a whopping 10 Mbytes of 128-bit wide M3 memory.

DDR and DMA controllers also reside on the chip. The DDR controller has up to a 200-MHz clock (400-MHz data rate) and a 16/32-bit data bus. It supports up to 1 Gbyte of DDR1 and DDR2 in one or two banks. The DMA controller has 16 bidirectional channels with up to 1024 buffer descriptors and programmable priority, buffer, and multiplexing configuration.

A chip-level arbitration and switching system (CLASS) provides full fabric non-blocking arbitration between the processing elements (and other initiators) and targets such as the M2 memory, DDR SRAM controller, and device configuration control and status registers.

The MSC8144 supports next-generation and legacy interfaces, such as dual Gigabit Ethernet, Serial RapidIO interconnect, UTOPIA, PCI, and time-division multiplexing (TDM).

The Serial RapidIO 1x/4x endpoint corresponds to Specification 1.2 of the RapidIO trade association. It supports read, write, messages, doorbells, and maintenance accesses in inbound mode and messages and doorbells in outbound mode. The PCI interface complies with PCI specification revision 2.2 at 33 or 66 MHz with access to all PCI address spaces.

Up to eight on-chip independent TDM modules offer features like programmable word size (2-, 4-, 8-, or 16-bit), hardware-base A-law/µ-law conversion, up to a 128-Mbit/s data rate for all channels, with glueless interface to E1 or T1 framers, and the ability to interface with H-MVIP/H.110 devices, TSI, and codecs such as AC’97.

With its multicore architecture and next-generation and legacy interfaces, the MSC8144DSP is well-suited for highcapacity infrastructure applications. These include triple-play (voice, video, and data) services, carrier class/enterprise Voice over Internet Protocol (VoIP) media gateway equipment, video-conferencing equipment, and WCDMA and WiMAX basestations.