More system resources currently concentrate on the man-machine interface that's implemented by graphics, audio, and video signal processing. At the same time, these signal processors demand additional computing performance. A number of media processors have been released for computer or set-top-box solutions. But they've typically been designed so that those systems will meet a specific performance target. A media explosion is taking place, though, with many new classes of systems—from small-format handheld platforms to high-end home-theater systems requiring audio and video capabilities.
The solution is a multiprocessor that makes it easy to develop applications that can run on a wide range of platforms. On the low end, this processor must be able to deliver the audio, video, and graphics at a cost appropriate for personal digital assistants (PDAs). Yet it has to provide outstanding performance for home theaters at the high end.
Tackling this range of requirements is the DVine architecture developed by Silicon Magic. It provides a modular, scalable processor that leverages embedded DRAM to deliver performance levels many times that of RISC or CISC architectures. The company will offer it as both an off-the-shelf chip for evaluation and as a core that can be licensed for companies to craft custom solutions. The first implementation is designed to be fabricated on a 0.18-µm, five-layer CMOS process that allows input clocks of 100 MHz. It includes large amounts of embedded DRAM to be integrated with the logic.
Standing for DRAM vector engine, the DVine architecture is based on a symmetrical multiprocessor approach with single-instruction/multiple-data (SIMD) extensions. That combination allows it to deliver compute throughputs higher than that of CISC and/or RISC processors or even specialized media engines. DVine also uses well-understood programming techniques employing C-language constructs. Anyone familiar with C programming can craft application software.
The architecture consists of two main modular sections. A compute module contains both scalar and vector processors. The memory-interface unit (MIU) ties banks of embedded DRAM into 128-bit-wide buses that connect everything (Fig. 1). Aside from that pair of blocks, DVine includes an external bus-interface unit (XBIU) that connects the chip to the host system. A data-flow controller (DFC) coordinates the movement of data between the MIUs and the compute modules.
Depending on the amount of horsepower needed, designers can combine multiple compute and memory modules on the same chip. To perform HDTV decoding, which requires MPEG-2 decoding at [email protected] resolution, a designer can use 11 compute modules and eight memory modules. A decoder for a DVD player performs MPEG-2 [email protected] decoding. It requires two compute modules and two memory modules, while a video phone that employs an MPEG-4 video codec algorithm needs just two compute modules and one memory module. Other systems, like an MP3 recorder/player that uses an MPEG-1 layer III codec or a digital camera that performs JPEG image coding, require just one of each.
With the combination of scalar and vector processors in the compute module, a single module can perform both setup and control of the vector computations. The vector engine rips through the computations needed for the audio, graphics, and video algorithms. The scalar engine is a RISC processor with a MIPS-like architecture that executes a single-issue, in-order instruction stream. With an input clock speed of 100 MHz, the processor delivers a throughput of 200 MIPS.
Inside the processor is a five-stage, pipelined, 32-bit data path and separate paths for instruction and data flows. The designers even added special registers to aid in processor-to-processor communications. In the compute module, a block of fast static RAM serves as an instruction cache. A register file that's shared between the scalar and vector processors allows the two units to exchange data easily.
To tie into the wide internal buses, the compute block includes a data-communications-channel controller (DCC) and a direct-memory-access (DMA) controller. The DCC is implemented with a multi-channel crossbar bus that can connect any computing module to any memory-interface unit or to the external bus-interface unit. With the wide buses, the DCC delivers an overall memory bandwidth of 3.2 Gbytes/s.
The companion vector engine also is a single-issue, in-order execution processor. It employs a 16-byte vector width and lets variable vector lengths achieve a raw throughput of 6.4 GOPS. In the vector unit, 16 identical 16-bit data paths speed the computations. That unit can execute zero-overhead loop iterations and perform horizontal data swapping to rapidly manipulate data structures.
The 16-channel SIMD vector processor is based on a dual-execution unit architecture, rather than the more common multiplier-accumulator (MAC) approach. It offers a very efficient structure for performing motion estimation, which is a key element in image-processing algorithms. At the same time, it doesn't preclude the implementation of MAC functions.
Each compute element in the vector engine processes one data sample in parallel with the other elements. The processor can perform up to 16 operations in parallel. It can thereby deliver the high throughput necessary to handle applications such as motion estimation, motion compression, quantization, filtering, scaling, and discrete cosine transforms.
Strength In Numbers
Multiple compute modules can be co-integrated on a single chip, enabling designers to create monolithic, symmetric multiprocessor systems. To give those modules access to data at speeds that won't slow down the calculations, the on-chip MIUs combine embedded blocks of high-performance DRAM with streaming-memory processors that have been optimized for audio/video algorithms. Employing a packetized data-access protocol ensures that the required MIU arithmetic operation is embedded in each data transfer. Multidimensional data accesses with a variable stride can then be implemented.
The memory banks controlled by the MIU are crafted from the company's high-speed memory cells. Similar structures are used in the embedded DRAM employed in Silicon Magic's graphics processors, so these cells have already been proven.
Supporting the memory banks in each MIU is a streaming-media processor (Fig. 2). It performs operations on streaming data to and from the memory, such as interpolation, decimation, alignment, and address translation. The RISC engine in the compute module can then focus on data setup and control functions, simplifying system operation.
To feed the data to the compute modules, the streaming-memory processor in each MIU takes advantage of the low-latency, high-bandwidth DRAM. By embedding that DRAM in the memory-interface unit, the designers were able to eliminate the access delays inherent whenever data must be transferred between devices. Every MIU addresses two internal memory modules. Each module contains a pair of half-megabyte banks of DRAM. The actual amount of memory can be changed to meet the application's requirements.
Supporting the streaming controller is an access controller that contains all of the circuitry needed to drive, address, and refresh the embedded DRAM. The streaming-memory processor performs advanced data operations, such as logical-to-scatter-mode address conversion, data interpolation, and data subsampling. The MIU also includes a DCC block that ties it into the wide on-chip buses.
The 64-bit XBIU interface joins the DVine engine to a host system. It carries both data and commands to and from that engine. When massively parallel computations must be done, multiple chips can be interconnected. The XBIU interface also can support off-chip SDRAM if an optional SDRAM controller is added to the chip and/or core.
Aside from the dual 128-bit-wide buses over which all of the internal data transfers take place, a 32-bit-wide bus forms a "ring" around the chip, interconnecting all of the blocks. This bus transfers control and communications information between modules within the architecture. Any computing module or memory-interface unit can query or set registers within any other compute module or MIU over this bus. The same can be done within the DCC interface, the 8-bit control I/O bus-interface unit, or the data-flow controller.
The control I/O bus connects all of the compute modules to the external bus-interface unit. It provides a path to eight external, configurable I/O pins that are part of the XBIU. A typical use of this bus would be to implement a bidirectional serial I/O bus for inter-device communications.
The 64-bit XBIU interface runs at clock speeds of up to 54 MHz in the initial versions of the DVine silicon. A 256-byte FIFO register is built into it to buffer read and write operations. The interface also includes configuration, general-purpose, and semaphore registers. A directory within it maps the address ranges of on-chip and off-chip resources. A data packet that points to an external address is passed to the external bus, and the appropriate handshaking signals are generated to handle the transfer.
To develop software for the processor, the company created a fully integrated development environment that employs an intuitive graphical user interface. That interface controls a suite of tools that can run on either a PC or a Unix platform (Fig. 3). They include project-management and compilation software, simulation software, a source-level debugger, an execution trace viewer and analyzer, and software-optimization tools. Because the programming can be done in C, software should be easy to write and understand. In comparison, ASICs often require hand-coded software or programs that must run on distributed memory systems or non-symmetrical multiprocessor systems.
The tool suite's powerful software debugger allows programs to be written in Concurrent C. It can perform assembly source-code-level debugging. With an instruction-cache model, designers can do in-line probing and view software execution.
Although the commercial versions of the DVine architecture target implementation with a 0.18-µm process, an evaluation chip fabricated using 0.25-µm design rules will be available next quarter. The chip contains six compute modules and two memory-interface units (4 Mbytes of on-chip DRAM).
It will actually come mounted on a PCI card that contains two reprogrammable FPGAs. Designers will be able to use these gate arrays to customize the system interface. Also included on the evaluation card are the video I/O interfaces necessary to implement an MPEG-2 codec. Firmware to implement that codec is supplied, so users can quickly get an application up and running.
To speed the creation of other applications, Silicon Magic has developed a library of C-callable DSP functions that have been optimized for the DVine architecture. These blocks can further reduce program development time and effort.
Price & Availability
Samples of the DVine evaluation chip, PCI card, and development tool suite will be available next quarter. In small quantities, the full development suite sells for $18,500. The core architecture can be licensed. That license is negotiable, based on various factors. The DRAM macro is available for licensing if the silicon supplier does not have such a macro in its library.
Silicon Magic Corp., 920 Stewart Dr., Sunnyvale, CA 94086. Contact Steve Musallam at (408) 331-8000; www.simagic.com.