Electronic Design

Highly Parallel DSP Architecture Makes Quick Work Of Image-Processing Algorithms

The combination of a high-speed RISC processor and an array of DSP blocks that form a single-instruction/multiple-data (SIMD) parallel processor delivers 4000 MIPS of compute throughput. Targeted at image-processing applications, the CW4011 visual signal processor (ViSP) is the first implementation of an embeddable core developed by ChipWrights Inc. of Newton, Mass.

This core supports a vector array of up to 32 fully pipelined DSP units that can perform 32 multiply-accumulates every instruction cycle. It can therefore deliver 8500 MMACs (when performing 32 8-bit MACs/cycle) when clocked at 266 MHz. Designed for low-power operation, it consumes just 0.1 mW/MIPS when powered by a 1.5-V supply.

That low power consumption suits system-on-a-chip (SoC) solutions for portable imaging applications like digital cameras. The image-processing capabilities can also be used in laser and color printers, scanners, image transcoders, photo kiosks, and digital color copiers. The CW4011 can deliver real-time video processing, too. When running MPEG2 encoding, it can deliver main-level, main-profile ([email protected]) coded data.

To achieve the high throughput, ChipWrights crafted the heart of the CW4011—the CWv8 processor core—to include both a RISC processor and the vector SIMD array (see the figure). The RISC processor (dubbed the serial datapath unit since it executes instructions sequentially) coordinates the algorithmic operations and has a moderate throughput of about 150 MIPS. The serial datapath lets the entire core function as the master CPU in many embedded applications, eliminating the need for a separate host processor.

The processor's SIMD portion enables one instruction to be simultaneously executed by a number (two, four, eight, or 16) of DSP processors called parallel datapath units. The CW4011 implementation uses eight parallel datapath units. Each performs operations on 32-bit longwords, 16-bit words, or 8-bit bytes. Included in each datapath unit is a 31-word by 32-bit register file, an extractor, a multiplier, an ALU with accumulator, and an inserter. Individual datapaths can be enabled or disabled during operation. The software can then provide some degree of power management.

The multiplier unit in the parallel datapath performs 32-bit by 16-bit multiplication using two's complement number representation. When called on to perform multiplications on smaller words, it can be logically subdivided to perform two 16- by 16-bit or four 8- by 16-bit multiplications in a single cycle, or a full 32- by 32-bit multiplication in two cycles. The multiplier can be configured to compute vector dot products or compute the sum of absolute differences, which is handy for frame-to-frame motion estimation in MPEG codecs.

The serial datapath unit resembles a conventional RISC CPU and acts as a coordinating processor for the parallel datapaths, providing address and extract/insert information. It can be used to access control registers and manage the program counter. Also, it includes a register file of 32 longword registers. Though this unit has its own set of RISC-type instructions, it shares the same instruction stream as the parallel datapaths.

In addition to datapath units, the CW4011's processor portion includes an instruction cache that can hold up to 2048 32-bit instructions and is direct-mapped to the datapaths. A primary memory block organized as four interleaved banks of 8 kwords by 32 bits each (128 kbytes total) is part of the processor block. Up to four 32-bit longwords can be written or read in each instruction cycle, reducing memory bandwidth bottlenecks. A direct-memory-access controller and system bus controller manage data exchanges between the cache, the primary memory, and off-chip data sources and destinations.

To make programming as easy as possible, the visual signal processor can be programmed in a high-level language such as C or in the core's native assembly language, CAS. Support for software development is available with the CodeWarrior software development tools from Metrowerks. The full development kit includes an optimizing C compiler, a cycle-accurate software simulator, a visual debugger and assembler that works over the core's JTAG test port, and a performance profiler.

The CW4011 is a test chip that the company developed to demonstrate performance. Along with the datapaths and supporting caches and logic, it includes an SDRAM controller that allows 16- or 32-bit data paths and operation at up to 133 MHz, a 16-bit host-peripheral interface, an 8/16-bit wide interface port, and a host of basic peripherals—DMA channels, counters, SPI and UART serial ports, and up to 32 general-purpose I/O pins.

For core licensing information, contact ChipWrights at www.chipwrights.com or (617) 928-0100.

TAGS: Digital ICs
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.