Hardware Directory

Feb. 5, 2001

12 min read

ARM9T Core The ARM9T family defines a set of RISC cores that are optimized to minimize cost and power dissipation. The family builds around the ARM9 Thumb architecture, a full ARM ISA also capable of dropping into a 16-bit ISA (32-bit data) mode. This ISA subset enables developers to compact their code by using 16-bit instructions, rather than 32-bit instructions. Savings for code densities are on the order of 30% for application code that can use the 16-bit Thumb mode.

Initially developed for European personal computers, the 32-bit ARM RISC architecture has made a home for itself in the embedded and low-power embedded space. ARM is one of the three 32-bit RISC architectures accepted as a 21st century standard. It's available in IC, Application-Specific Standard Parts (ASSPs), and core form. In addition, the design is licensed to companies that can, if they are willing to get the special license, modify the ARM microarchitecture. For example, the Intel StrongARM RISC is a modified ARM that has been tuned for high performance. StrongARM is the base for Intel's XScale communications processor systems.

The ARM architecture builds from simple beginnings. As such, it has a small set of RISC instructions and a 16-register 32-bit register file (there are 32 32-bit general registers). Most instructions are conditional—that is, they can be executed if and only if they meet a test condition specified by the instruction itself. This ISA tactic minimizes branches and ups execution throughput. ARM has defined a sophisticated multilevel busing system that's available to designers. Called the Advanced Microprocessor Bus Architecture (AMBA), it consists of a local bus and a peripherals bus. The ARM9T processors interface to the AMBA local bus.

Architecture ARM implements 11 basic instruction types. Most instructions are conditional, enabling the hardware to test the specified condition before executing the operation defined by the instruction. This conditional feature cuts down on branches, increasing code execution throughput. Many architectures are now putting some conditional instructions into their ISAs.

On the negative side, ARM uses a smaller-than-usual general register file, and 16 32-bit registers instead of the normal 32 registers. This makes coding and compiling a bit more difficult, although it keeps the cores small. There are 32 32-bit general registers, but only 16 are in the general register file. Four of those registers are used for a fast interrupt response, enabling the CPU to respond in 4 clock cycles.

As in many of the early RISCs, ARM lacks a full arithmetic complement of operations. For example, it doesn't have a hardware divide function. Divide is made up of general RISC instructions, as was done in many early RISCs. The ARM ISA has hardware multiply, however, and some of the latest cores have hardware MAC instruction too. New instructions to the ARM9E add fractionalizing and saturating arithmetic to the ISA. These enable code to pack in 16-bit operands/results and to have guard bits for MAC operations.

The Thumb ISA is a recent addition to the architecture, and other vendors have copied the tactic. The idea is basically simple: 16-bit ISAs take up less code space, even with 32-bit datapaths. The original 68K had an extensible 16-bit ISA and a 32-bit datapath. It's still being used as a code compactness benchmark. Thumb implements a 16-bit ISA mode in the ARM architecture. The processor can run along in a 32-bit, normal mode, and switch to a 16-bit mode in a function call or return. In the 16-bit mode, the CPU uses a smaller register file made of 8 registers (to minimize register fields in the ISA), resulting in more compact code. A thread can easily switch back to a 32-bit mode by switching the mode bit in a function call or return.

See associated figure

PERFORMANCE

Scalar RISC architecture

Static design—0 to 140,...200 MHz

5-stage pipeline

Most instructions execute in 1 cycle

16 KB I, D caches (8 KB for ARM922T)

Core includes caches

Supports a fast context switch

Runs as 32-bit or 16-bit RISC to compact code

11.8 mm² on 0.18-µm CMOS

160 mW dissipation w caches, MMU @ 200 MHz

FUNCTION

32-bit ARM RISC architecture

Includes Thumb ISA, a 16-bit subset of ARM ISA

32-general registers, 16 in register file. Thumb uses a subset if 16 registers

External coprocessor interface can add FPU

Supports memory buffering

64-entry data TLB & 46-entry instr. TLB

Compact ISA with built-in conditionals

Supports 32-bit AMBA bus, multiple masters, nonmultiplexed

CORES

ARM7

ARM9

ARM10

StrongARM

MIPS32 4Kc Core The MIPS 4Kc core is a later revision of the MIPS architecture, targeting hard- and soft-core SOC implementations, including FPGA chips. It's one of the newer revisions of a classic RISC processor. MIPS, one of the first RISCs developed, has now settled down as a 21st Century standard architecture for 32-bit and 64-bit processors. Licensed from MIPS Technology, the MIPS architecture has a wide variety of suppliers and customers. These range from classic chip houses doing application specific standard parts (ASSPs), to ASIC and FPGA vendors that use MIPS-based cores, to OEMs that do their own chips with an on-chip MIPS CPU. MIPS dominates as the standard multimedia 32-bit/64-bit processor.

The MIPS 4Kc core was created with custom SOC applications in mind. It delivers a 200+ MHz performance with relatively low power dissipation. It was designed to be portable across multiple processes. Also, like most MIPS implementations, the MIPS 4Kc makes for a very compact silicon layout. Level 1 caches, up to 16 KB I and D caches, can be added as well as a JTAG test block. Clock rates depend on the silicon process used. For 0.25-µm processes, the MIPs core supports clock rates of up to 200 MHz. Finer geometries up the clock rates to 266 MHz or more. Furthermore, the processor implements a special multiply/divide unit that supports 1- or 2-cycle MACs (32- x 16-bit, or 32- x 32-bit).

Architecture A classic MIPS RISC processor, the MIPS 4Kc implements the MIPS32 architecture along with MIPS II instructions. It also adds special multiply/accumulate MAC instructions, conditional moves, prefetch, wait, and leading 0/1 detect instructions. The processor provides a 32-bit privileged resource architecture with R4000-style memory management.

This is a 5-stage, classic load-store, scalar RISC implementation. It can issue and execute most instructions in 1 pipelined cycle. As such, it builds around a register file of 32 32-bit registers. It has 4 execution units: integer (ALU + shifter), multiply/divide, branch control, and processor control (privileged functions and exceptions). Moreover, it supports the traditional MIPS coprocessor interface to add even more functionality.

The processor is highly adaptable to different L1 cache sizes and organizations. Caches are added to the core as standard cell blocks. The core supports an I and a D cache. Each can range from 0 to 16 KB. The cache organization may be 1-, 2-, or 4-way set associative, depending upon the necessary cache efficiency.

The MIPS 4Kc implements an MMU to translate virtual addresses to a physical address. Plus, it supports memory protection. The MMU is comprised of translation lookaside buffers (TLBs). It has 3 address translation buffers: a 16 dual-entry, fully associative joint TLB, a 3-entry instruction micro TLB, and a 3-entry data micro TLB. When an address is translated, the appropriate micro TLB is addressed. If a miss occurs, then the joint TLB is addressed. If that too is a miss, then a miss exception is taken.

The processor instruction set architecture is an upgrade of the standard MIPS32 ISA (MIPS 1 + MIPS2 ISAs). Among the additions are conditional instructions and prefetch. Conditional instructions minimize branches by having the test condition as part of the instruction. An example is Move Conditional On Zero, where if the condition is met, the move is executed, all in one instruction. The prefetch instruction enables programmers to get a data cache line in early, before it's needed, avoiding a cache miss and its attendant overhead. But if the prefetch results in a memory exception, that exception isn't taken and the operation is dropped.

See associated figure

PERFORMANCE

Scalar RISC processor

Static design—0 to 200 MHz

5-stage pipeline

Most instructions execute in 1 cycle

1-cycle branch delay

I, D L1 caches, to 16 KB each

32-bit x 16-bit MAC executes in 1 cycle

32- x 32-bit MAC executes in 2 cycles

3 mm² at 0.25-µm CMOS

2 mW/MHz power dissipation w/caches

FUNCTION

32-bit MIPS32 architecture

Some R4000/R5000 features: prefetch, conditional moves, privileged mode instructions

32 32-bit registers register file

4 execution units including Mpy/Div unit

8-word (32-B) write buffer

easy integration with off-core L1 caches

Optional JTAG test block

EC nonmultiplexed 32-bit bus

Available in synthesizable, hard formats

CORES

MIPS32 4Kp: low power core compatible with MIPS R3000 and R4000

MIPS32 4Km: higher-performance 32-bit core with a fast 1-cycle MAC

MIPS32 4Kc: 32-bit core optimized for higher-performance levels. Used for SOCs and FPGAs. Supports MAC

MIPS64 5Kc: a higher-performance 64-bit core for data flow applications. Can deliver a 360 DMIPS peak performance

MIPS64 20Kc: latest MIPS core. High-end, 64-bit MIPS processor supports MIPS 3D ISA extensions. Delivers 1,000 DMIPS peak performance

PowerPC 405C Core The PowerPC 405C core is a member of the 400 series of cores and silicon processors. The 405C is compatible with the PowerPC family, but it's designed for hard- and soft-core applications. It's a scalar RISC that supports dual I and D caches. And, it runs at clock rates up to 380 MHz for 0.13-µm, 1.8-V implementations.

The PowerPC is one of the 3 mainstream 21st century 32/64-bit standard architectures. It started as a joint development of Apple, IBM, and Motorola to overpower the x86 architecture with a classic RISC processor family—hence the PowerPC name. This effort failed, as the x86 kept stepping up the silicon curve, upping its processor power. Only lately has the PowerPC come into parity with the x86 Pentium-class processors. In some cases, PowerPC has passed x86 in delivered processing power.

So, the PowerPC is not a small RISC architecture. Instead, it was designed for desktop, server, and embedded applications. As such, it has a full set of RISC instructions and a range of implementations running from the minimal 400 series to the PowerPC G4, a 4th-generation RISC with an on-board vector processor. The PowerPC is available as standard IC processors, ASSPs, and cores. IBM supplies the PowerPC cores.

To support the PowerPC cores, IBM has developed the CoreConnect busing system. This is a sophisticated, split-transaction, multilevel busing system for ASICs and FPGAs.

Architecture The PowerPC 405C core builds on the PowerPC ISA. It is a classic, load-store RISC but has multiply/divide instructions. It centers on a 32-register, 32-bit register file, with 2 write and 3 read ports. The CPU is very straightforward, with a 3-buffer fetch queue holding up to two instructions each. These instructions are issued to 1 of 2 execution units: the ALU and the MAC unit. The ALU handles basic arithmetic and shifting operations, and the MAC unit handles multiply, divide, and the DSP-ish multiply-accumulate operations.

The 405C core implements a static branch-prediction algorithm, based on a statistical analysis of standard code blocks. Branches that have a negative address displacement or do not have a test condition are taken as default. The decode and prebuffers can handle 2 branches simultaneously. For example, a branch followed by a branch is directly handled.

The basic instructions take 1 pipelined cycle to execute. (They appear to execute in 1 cycle as the next instruction is started up, but they must pass through the 5-stage pipeline.) Some instructions will take multiple cycles. These include most multiply instructions (1 to 4 clock cycles) and divide instructions (which can take 35 latency cycles).

The core supports I & D caches. The caches' sizes are configurable from no cache (0), 8-, 16-, or 32-KB caches. These are 2-way set-associative and nonblocking. An MMU supports the caches. This MMU implements a 64-entry unified TLB with faster 4-entry (I) and 8-entry shadow TLBs for the caches. The MMU supports 8 page sizes, from 1 KB to 16 MBs. The system handles both big- and little-endian orientations.

The MAC adds 9 operations to the 405C ISA. MAC instructions operate on 16-bit operands and accumulate 32-bit results. All operations have to take a single pipelined cycle. The MAC operations include both positive and negative MAC hi-halfword-to-word, MAC low-half-word-to-word, and MAC cross-half-word-to-word, as well as MPY hi-half-word-to-word, MPY low-half-word-to-word, and MPY cross-half-word-to-word operations.

See associated figure

PERFORMANCE

Scalar RISC processor

Static design—0 to 266 MHz (0.18 µm, 2.5 V), 0 to 380 MHz (0.18 µm, 1.8 V)

5-stage pipeline

1-cycle execution for most instructions

supports 0, 8-, 16-, or 3-KB I and D L1 caches

Multiply and divide

64-bit static branch prediction

1.5 mm² to 2.0 mm²

0.5- to 1.0-W power dissipation

FUNCTION

32-bit PowerPC RISC

MMU with 64 TLB entries

32 32-bit registers register file

2 execution units: ALU, MAC

JTAG block, also Trace FIFO

processor local bus (PLB) connects to each cache

Device Control Register Bus for component configuration

64-bit timebase with 3 timers (fixed and programmable interval, and watchdog)

CORES

The 400 series has a number of cores. These cores differ in process, voltage, and cache sizes. The two supply voltages are 2.5 and 1.8 V. Values for 2.5 and 1.8 V are represented as x/y, with x the 2.5-V value. Power dissipation is given at the 2.5-µm top MHz.

405A3: 0.25-µm, 2-mm² core, 0 to 200/300 MHz, 1.0 W, and 32-KB I, D caches

405B3: 0.25-µm, 2-mm² core, 0 to 200/300 MHz, 650 mW, and 1-KB/8-KB I, D caches

405D4: 0.18-µm, 1.4-mm² core, 0 to 266/390 MHz, 500 mW, and 16-KB I, D caches

The 440 Core is a high-performance, 2-way superscalar processor with out-of-order instruction issue, execution, and completion. It supports 32-KB I and D caches and has a 7-stage pipeline.

440: 0.1 µm, 0 to 440/550 MHz, 1.0 W, and 32-KB I and D caches