LSI Designs Bring MPEG-4's Potent Multimedia Performance To Fruition

Multimedia is taking off like gangbusters in entertainment, education, and medicine. But the huge amounts of data—text, speech, music, images, graphics, and video—make representation a challenging task indeed. Admittedly, much work has been done in the fields of effective representation by means of compression, storage, and transmission. Yet only scant attention has been paid to content accessibility and manipulation—that is, until MPEG-4 arrived.

Through object-based representation, the crucial ingredient that distinguishes it from earlier versions, MPEG-4 enables users, for the first time, to combine graphics, text, and synthetic/natural objects into a single-bit stream.

Another attribute that makes MPEG-4 so attractive for wireless technology is its support for scalable content. In fact, one of the initial goals that the developers had in mind was to provide tools and algorithms for very low bit-rate coding of audio/visual data. This means you encode just once, but acquire complete freedom to play back at different rates with acceptable quality for whatever communications environment is at hand. For example, in a mobile, video-phone telephony application, the user can request a higher frame rate and spatial resolution for the talking person, and a lower rate for the background objects.

But there's a catch. MPEG-4 is the most complex standard yet in the multimedia sector. Several papers at the International Solid-State Circuits Conference addressed these issues, finding ways to enhance performance and minimize power consumption. One is a Session 9 paper entitled "A 90-mW, MPEG-4 Video CODEC LSI With The Capability For Core Profiles." Its authors are with Matsushita Electric Industrial Co. Ltd., Fukouka, Osaka, and Kanagawa, Japan.

This chip contains approximately 31 million transistors on an 8.8- by 8.6-µm die. It's made on a 0.18-µm, 1.8-V, quad-metal CMOS process.

High performance, high flexibility, low power, and low cost are all necessities in an LSI design to optimize services based on object-based coding in mobile visual applications. As the authors point out, though, power consumption of high-performance processors is high. They were, however, able to devise dedicated hardware that uses less power while delivering higher performance than a software implementation—even though the latter might be tailored to fit the defined function.

The chip comprises a 20-Mbit embedded DRAM, a programmable DSP, and eight dedicated hardware engines (Fig. 1). It can simultaneously encode and decode 15 QCIFs or quarter common intermediate formats (176 by 144 frames/s for H.263 and MPEG-4 simple profile@Level 1). It decodes at 30 CIF, which is 352 by 288 frames/s for simple profile@Level 3, and 15 QCIF frames/s for core profile@Level 1 with four objects. When operating at 54 MHz and performing simple@L1 simultaneous encoding and decoding, as well as core@L1 decoding, the chip consumes only 90 mW. There also are three interface units, which include a video processing unit, a memory interface, and a host interface.

The DSP core employs vector pipeline architecture. The chip has two types of dedicated hardware engines. One, in the vector pipeline, performs operations like DCT/Q, IQ/IDCT, and DCT/IDCT. Post-noise reduction and composite engines are of this type. The other can be thought of as a coprocessor, with the engine and the DSP each performing independent operations. Motion estimation, variable-length coding, variable-length decoding, padding, and context-based, binary arithmetic decoding all fall into that category.

Each block uses clock gating, reducing power consumption by 60%. When any of the dedicated hardware engines completes a task, its clock is disabled until the DSP starts the engine the next time.

The three dedicated hardware engines devoted to core profile decoding are the context-based, binary arithmetic decoding, padding, and composite engines. The context-based, binary arithmetic decoding engine decodes the shape data by one binary alpha block. Note that a software implementation couldn't execute the context-based, binary arithmetic decoding at high speed, due to the many bit operations and the complex conditional branching.

To reduce power consumption in the external I/O circuits, the chip employs an embedded DRAM. A total of four, 4-Mbit DRAM macros for the core functions and two 2-Mbit DRAM macros for the display are integrated into a single chip. This equals a total of 20 Mbits.

In the case of a DRAM, the higher the access activity, the larger the access current. Also, the access current depends on the memory capacity per macro. Successively dividing the DRAM micro into smaller and smaller slices diminishes the power consumption of the embedded DRAM. The area of the multi-DRAM micros, however, becomes larger in comparison to the single-macro scheme. A configuration comprising four 4-Mbit DRAM macros is used here. The access activity of simple@L1 simultaneous encoding and decoding is about 15%, whereas the one including graphics data has around 50% in estimation. Note that the access activity for the work and the display area are separate.

High Performance At 160 mW In another proposed system, an 80-MHz, ARM9-compatible, RISC processor sports an f-stage pipeline and a 32- by 32-bit MAC unit in the data path to enhance its multimedia processing capability. The MAC unit improves performance up to 23% when executing computation-intensive routines, such as DCT/IDCT, compared to a conventional multiplier-only data path. The overall system consumes just 160 mW with all functions operative.

This approach is described in a Session 9 paper entitled "An 80-/20-MHz, 160-mW Multimedia Processor Integrated With Embedded DRAM, MPEG-4 Accelerator And 3D-Rendering Engine For Mobile Applications." Its authors are with the department of electrical engineering at the Korean Advanced Institute of Science and Technology, in Taejon. The team developed an integrated low-power, programmable processor with dedicated accelerators and an embedded DRAM macro to perform multimedia functions.

Motion compensation is the most computation and memory I/O intensive portion of the overall decoding algorithm. It's mapped onto dedicated hardware to support an MPEG-4 video stream @SP QCIF, at 15 frames/s. Through a 128-bit internal bus between an embedded DRAM frame buffer and a logic core, the motion-compensation accelerator, which comprises eight processing elements, can process data in parallel at 20 MHz. Also, the integrated DRAM frame buffer has a structure that's tightly coupled with access patterns for motion compensation. It eliminates any need for an external data I/O and its attendant power consumption during data processing.

After the RISC CPU preprocesses the input polygon data, the 3D-rendering engine (RE), which has Z-compare, smooth shading, a-blending, and double-buffering functions, draws a scene with 256- by 256-pixel resolution at 2.2 Mpolygons/s. The 3.2-Gbyte/s data bandwidth through a 2048-bit internal memory bus, together with a 640-bit processor bus, permits lowering the operating frequency of the 3D RE to 20 MHz.

To overcome the processing speed and the data bandwidth gap between the RISC processor and the dedicated hardware, a bandwidth equalizer with a 2-kbyte, dual-port SRAM with a 32-bit input and 512-bit output is inserted. Data arriving from a 32-bit bus at 80 MHz, ARM-to-equalizer, is fed via a 512-bit wide bus at 20 MHz, equalizer-to-dedicated-hardware. A digital logic loop synchronizes the 80-/20-MHz clocks over a 10- to 200-MHz operating range.

Employing a single bit-line-write scheme improves the power efficiency of the DRAM macro (Fig. 2). The bit-line pair is disconnected from the sense amplifier after the charge-sharing occurs between the bit-line and the cell. Then, only the bit-line connected to a cell is tied to the sense-amplifier node to restore cell data while the other reference bit-line remains disconnected. Redundant reference bit-line transitions, common in conventional folded-bit-line structures, are thereby eliminated. This brings about a 20% reduction of power consumption during the sensing interval.

High power consumption in I/O transactions is avoided by embedding DRAM on the same chip. By lowering the operating frequency to 20 MHz, power consumption of the RISC, motion-compensation accelerator and the 3D-RE are held to a maximum of 12 mW, 4.6 mW, and 36 mW, respectively. Adopting low-power techniques like a single-bit-line write and distributed nine-tile-block mapping reduces power further.

The chip was made on a 0.18-µm CMOS process with three polysilicon and six metal layers. A 1.5-V power supply is used for the logic core, and 2.5 V and 3.3 V are used for the DRAM and I/O, respectively. Including the I/O cells, the chip has an area of 12 by 7 mm (84 mm²).

Real-Time Image Processing Another Session 9 paper also addresses power-consumption and high-performance issues. Entitled "One Chip,15-Frame/s, Megapixel, Real-Time Image Processor," it was delivered by the Sanyo Electric Co. Ltd., Gifu, Japan. It describes how this Sanyo team developed a single-chip LSI for real-time, image processing and/or real-time video compression.

The chip contains the necessary functions for digital video or digital still cameras. These include megapixel-CCD processing with a JPEG/MPEG-2 image-compression engine. The chip also includes a 32-bit RISC CPU, an NTSC encoder, and synchronous DRAM controllers. And, it has various peripheral interfaces, like PCMCIA, SSFDC, USB, UART, IDE, and P1284 interface standards, plus an instruction cache and a data cache (both 4 kbytes) for the RISC CPU.

The image data compression/decompression engine comprises six blocks and can encode and decode motion-JPEG (M-JPEG) and MPEG-2 data. Its optimized pipeline structure can process image data in 8- by 8-pixel units in parallel. As a result, each 8- by 8-pixel block can be processed in 68 clock cycles, making high-speed compression and decompression possible, even for images with many pixels.

A novel method of reducing the total encoding time has been adopted. During the image-compression step, compression parameters like the quantizer scale need to be determined before compressing the image data. To obtain these parameters, several calculation trials using whole blocks of images must be carried out. However, this unusual technique decreases the number of blocks needed to determine the compression parameters.

In the first trial phase, one of 16 blocks is selected for calculation. In the second trial phase, one of four blocks is employed. As a result, the total encoding time is reduced. The method has proven to be three times faster than conventional techniques.

To cut power consumption, two power-management techniques are employed. The first is a clock-gear technique that cuts power usage by up to 20%. It does so by dynamically switching the operating frequency between 57 MHz, 28 MHz, and zero MHz (off), based on the amount of processing required by the CPU. For example, when the image data is recorded onto the card memory, the operating frequency is 57 MHz. When the CPU is controlling the auto focus, auto exposure, and auto white balance for the CCD, the clock frequency drops to 28 MHz. If the CPU isn't processing, no clock will be supplied.

A clock-suspension control technique further reduces power consumption of the system by 60%. It turns off clock pulse trains connected to inactive function blocks. Suspension is governed dynamically by firmware, in accordance with the camera's operating mode.

This device operates at clocks of up to 114 MHz. It's made on a 0.25-µm, 4-layer, metal-wired process with chip dimensions of 13 by 13 mm, and it has approximately 8.5-million transistors. Total power consumption is about 900 mW at a 57-MHz clock rate when powered by a 2.5-V source. This is the same power consumed by an earlier design, despite the fact that this new design uses 60% fewer transistors.

A device well suited for next-generation set-top boxes and home servers was presented in a paper by the Semiconductor Network Co. of Sony Corp., Tokyo, Japan. This Session 9 paper, called "A 250-MHz Single-Chip Multiprocessor For A/V Signal Processing," covers a 250-MHz, single-chip multiprocessor developed for audio/video signal-processing applications. This chip was fabricated on a 0.25-µm, four-metal-layer process and consumes 2.4 W at 2.5 V.

The chip implements multichannel decoding, encoding, and transcoding of various audio/video codec standards—like MPEG-1, MPEG-2, and MPEG-4—and digital video. It also performs MPEG-2 (MP@HL) video decoding at 20 frames/s. The multiprocessor employs coarse-grained parallelism in audio/video signal processing with a symmetrical multiprocessor architecture. Moreover, it accomplishes fine-grained parallelism with multimedia extended instructions.

The multiprocessor comprises four processing elements, a 64-kbyte L2 cache, a DMA controller, a synchronization unit, and other peripherals on a single chip.

Each processing element, in turn, comprises a MIPS-II, ISA-based CPU, an 8-kbyte, L1 instruction cache, a 4-kbyte, L1 data cache, and a 4-kbyte scratch pad memory for audio/video data. The processor also contains a subword-parallel processing unit, a data transfer control unit for symmetric processing, and a bit-stream processing unit.

The multiprocessor is implemented using standard, cell-based design methodology. Even the digital phase-locked loop is synthesized from standard cells. Custom cells were added, but only for the data-path design to enable 250-MHz operation.