Electronic Design

64-Bit Architecture's New Instructions Take On Digital Entertainment, More

The R20K processor core features a raw throughput estimated at 720 to 1000 MIPS and 1.4 to 2 GFLOPS. This 64-bit CPU also includes an extended instruction set that lets it make short work of 3D-graphics computations. Designers at MIPS Technologies Inc. of Sunnyvale, Calif., achieved this by supplementing the integer-processor portion of the CPU with a single-instruction/multiple-data floating-point unit (SIMD FPU) that can deliver 2 GFLOPS when clocked at 500 MHz. By combining high-throughput integer and floating-point operations, the core can be used in digital consumer, Internet connectivity, and other applications.

The processor's basic architecture is a new ground-up design, compared to the previous-generation R4000, R5000, and R6000 series CPUs. Its microarchitecture consists of a dual in-order issue, out-of-order completion CPU with a seven-stage pipeline. The instruction execution unit can issue two instructions every clock cycle—two integer instructions, one integer and one floating-point instruction, or one load/store operation.

To make the CPU possible, designers used about 2 million logic transistors to implement six integer execution units and the SIMD FPU. They used another 5 million transistors to implement two 32-kbyte four-way set-associative caches, the translation look-aside buffers (TLBs), and other on-chip memory blocks (see the figure).

The MIPS-3D instruction extensions accelerate the geometry-processing algorithms required for 3D graphics. Designers added 13 new instructions to the processor. Coupled with the SIMD architecture, the new commands let the R20K deliver between 18 million and 25 million polygons/s, or 8 million to 10 million polygons/s with lighting. Some of the instructions include specialized addition and multiplication operations for vertex transformations, other operations for clip checks, reciprocal and square-root functions for perspective and lighting computations, and still others for format conversions.

The ground-up design of the R20K's microarchitecture enabled designers to achieve a balanced and highly modular design that almost splits the chip down the middle. All of the compute resources were grouped on one side of the chip's physical layout, while all of the caches were grouped on the other side. This lets designers expand the caches without impacting the design of the CPU core itself.

This move is a key aspect of the plan to use the R20K both in standalone CPUs and as a core building block that can be embedded into custom system-on-a-chip (SoC) solutions. The basic 64-bit integer core, without the SIMD FPU and the caches, requires about 9 mm2 of silicon when fabricated in a six-level metal, 0.18-µm CMOS process. The full core, with the SIMD unit and dual 32-kbyte caches, requires 34 mm2.

Designed for operation at clock speeds of up to 500 MHz, the processor's internal buses enable instruction-cache fetch bandwidths of up to 8 Gbytes/s. Instruction-cache fills can be completed at rates of up to 16 Gbytes/s. Dynamic branch prediction and return prediction are performed in hardware to speed those operations. Up to two branches can be predicted in one fetch block per cycle. A fetch buffer that can hold eight instructions helps decouple the fetch unit from the decode/dispatch unit, so each can operate at optimum speed.

The data-cache bandwidth is half that of the instruction cache for read operations. For write operations, however, it can handle the same 16-Gbyte/s fill rate as the instruction cache. It has a nonblocking architecture that allows misses on up to four unique cache lines. Also, a four-entry miss-transaction queue supports up to four outstanding load misses without address restrictions.

In addition to the wide on-chip buses that permit high-speed data movement between the various blocks, high-speed external buses collectively known as the MGB-Link give system designers a 3.6-Gbyte/s interface that clocks at 150 MHz. To achieve this speed, the bus interfaces employ a source-synchronous double-data-rate (DDR) transfer mode. The Link interface includes both a 32-bit address/data processor output and a 64-bit address/data processor input.

The interface also allows features such as out-of-order data return, credit-based flow control, and external invalidate and intervention transactions for coherency. It can operate in either synchronous or source-synchronous (DDR) modes. Additionally, series-terminated 1.5-V HSTL signalling minimizes noise and switching power.

The R20K and its core version, the R20Kc, weren't simply designed for low power consumption. They also are meant to be manufactured on multiple processes to give licensees maximum flexibility in selecting a foundry. To keep the power low, MIPS engineers crafted the CPU to power up with the instruction cache in a one-way mode. The 48-entry jump TLB only powers up if a miss occurs in the micro-translation buffer. Dynamic fine-grained clock conditioning was used in the chip design to reduce clock and logic-block power dissipation by only sending clocks to the active logic blocks on the chip. The combination of power-savings tricks keeps the processor's power dissipation to about 4.4 mW/MHz when powered by a 1.8-V supply.

Although MIPS doesn't sell CPUs, it has already licensed the R20K and R20Kc designs to NEC Corp. and Toshiba Corp., both of Tokyo, Japan, and to TSMC of Hsin-Chu, Taiwan, the Republic of China. First samples of the R20K are expected by next quarter. The first test silicon for the R20Kc version should be available in the fourth quarter. The standalone CPU version, the R20K, will be housed in a 352-contact BGA package. For more information, check out www.mips.com.

TAGS: Components
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish