Electronic Design

64-Bit Processors Promise Power-Packed Solutions

Wide buses, sophisticated pipelines, and large caches help high-end processors handle hundreds of threads.

Smaller transistors and larger die sizes are radically changing the way 64-bit processors are implemented and where they're found. This holds true particularly for high-performance server and desktop computing, but also for embedded applications where 64-bit performance and throughput make many jobs practical.

There's a surprising amount of variety in the 64-bit arena, even taking into consideration the target audience. Some architectures span the space from embedded to giant server clusters (e.g., the PowerPC architecture). Others span the compatibility space (e.g., MIPS Technologies' 32- and 64-bit embedded processor line). Most provide some level of upward compatibility (e.g., AMD's support of the x86 architecture). Then there are those that start a line of their own (e.g., Intel's Explicitly Parallel Instruction Computing, or EPIC, architecture).

Products such as AMD's Opteron and Athlon 64, Intel's Itanium 2, and Sun's UltraSparc III architectures are explicitly designed and marketed for PC systems. Others, like SuperH's SH-5 (see Drill Deeper 7364, "64-Bit IP Provides Embedded Solutions," at www.elecdesign.com), are only available as intellectual property (IP) to be incorporated into custom, embedded applications. We'll look at the architectures that can be found in off-the-shelf products. For example, Broadcom is just one of many vendors providing a number of standard parts that incorporate the MIPS64 architecture.

Don't look for a dominant architecture. CISC, RISC, and the new EPIC designs are delivering equally impressive performance and are unlikely to disappear any time soon. Most surprising is that each approach is delivering comparable throughput. This still makes for lots of variance in performance based on what surrounds the core, including the amount and speed of the cache, the type and speed of the bus interface, and the silicon technology.

Compiler technology is also very important to the performance of 64-bit processors (see Drill Deeper 7365, "Compilers Critical To CPU's Success," at www.elecdesign.com). This is especially true when utilizing Intel's EPIC architecture and the SIMD vector support in IBM's AltiVec enhancements to the PowerPC.

A number of common tactics is used in the design of these high-performance processors. The first is large, multiple caches. While it's possible to get a MIPS processor with only a level 1 cache, most systems have at least a level 2 cache. Some, like Intel's Itanium 2, have megabytes of level 3 cache. Cache is key not only for individual execution threads, but also to handle a large number of threads normally found in most application environments. The second common item is on-chip memory controllers. Moving memory closer to the core reduces latency.

This crop of 64-bit processors may not be on top in terms of numbers shipped, but they definitely come out on top when it comes to crunching numbers

THE 64-BIT x86
AMD aims to put a 64-bit processor on the desktop, laptop, and server. The Athlon 64 and Opteron have different names, yet they share a common AMD64 64-bit core that extends the x86 architecture in a fashion similar to past x86 migration, from the 8086 to the 80286. In fact, the 32-bit legacy mode simply makes the processor look like a very fast 32-bit Athlon processor.

The AMD64 doubles the size and number of registers compared to the 32-bit Athlon and Pentium 4. AMD determined that 16 registers were the best combination for high performance, system overhead, and hardware real estate. The 64-bit registers are accessible in native 64-bit mode or in mixed 32/64-bit mode. AMD accomplishes this magic by including only three new instructions. Two are for mode changing, and one is a prefix byte that allows the CISC instruction stream to refer to the 64-bit registers. The average 32-bit instruction length is 3.2 bytes, whereas the 64-bit average only grows to 3.7 bytes.

HyperTransport is central to the AMD64 design. It provides high I/O bandwidth and doubles as a NUMA (non-uniform memory access) SMP (symmetrical multiprocessing) link that makes the creation of multiprocessor systems a snap. Single-processor incarnations have a single, non-cache-coherent HyperTransport link. Multiple processor chips have three cache-coherent HyperTransport links.

As with most 64-bit designs, the AMD64 increases performance through a number of methods such as the use of HyperTransport and a low-latency, on-chip double-data-rate (DDR) memory controller. A superscalar design with a number of execution units helps the AMD64 maintain high code execution performance.

The AMD64 architecture is new. Its success may push others to develop their own 64-bit x86 processors, but that's another story.

• Target Servers and PC
• Availability AMD
• Architecture CISC
• Operating systems x86-compatible OS, including Linux, Unix, Windows
• Core CICS, 6 instructions/cycle with double dispatch operations
• Execution units 3 integer, 3 address generation, 1 multiplier, 1 FP, 1 load/store, 1 branch
• Register file 16 integer, 8 FP, 16- by 128-bit media registers
• Instruction set 8-bit CISC/SIMD (MMX, 3D Now!, SSE, and SSE 2)
• Memory Level 1 cache: separate code/data 2- by 64-kbyte, 2-way; Level 2 cache: integrated 1 Mbyte, 16-way; On-chip DDR memory controller
• Bus Interface 16-bit HyperTransport (1 to 3 links)
• Power management Variable voltage and clock
• Features 32-bit x86-compatible mode, mixed 32/64-bit mode, built-in NUMA SMP support

Intel's Itanium 2 EPIC (Explicitly Parallel Instruction Computing) architecture is the newest of all 64-bit processors. But it's backed by a tremendous amount of research and development. The architecture moves the burden of optimizing the use of multiple execution and system resources from the hardware to software. In particular, the compiler now has the job of addressing instruction ordering and branch prediction. These tasks are normally handled by complex hardware on other designs.

EPIC utilizes 32-bit RISC-like instructions, making a large cache critical to good performance as with most RISC designs. Itanium 2's three-level, on-chip cache configuration lets the chip deliver the kind of performance necessary for servers and high-end workstations. The level 1 cache operates at one clock cycle, while the level 2 cache operates at five to seven clocks.

The Itanium 2 adds execution units to the original Itanium. Applications are upward-compatible, but without recompilation programs, they will only take advantage of the number of execution units for which they were compiled. These programs hope to gain from other chip improvements, including a higher clock rate. The Itanium 2 has a good deal of resources for the compiler to exploit. This includes a large integer and floating-point register file with register stack management. The latter handles 96 registers within the register file.

Even with EPIC's software-optimized architecture, the core of the Itanium 2 is actually smaller than the 32-bit Pentium core. The Itanium 2's large chip size comes from the large level 3 cache.

As with most 64-bit designs, Intel will push its next generation into multiple on-chip processors. Multithreading will make the next dual-processor chip look like four processors.

• Target Servers and workstations
• Availability Intel
• Architecture EPIC
• Operating systems Linux, Unix, Windows
• Core EPIC, 11 issue ports, 8-stage, 6 instructions/cycle
• Execution units 6 integer, 2 FP, 2 load/store, 3 branch
• Register file 128 integer, 128 FP 64 predicate registers
• Instruction set 32-bit EPIC/SIMD
• Memory Level 1 cache: integrated 32-kbyte code, 32-kbyte data, 2-way; Level 2 cache: integrated 256-kbyte minimum; Level 3 cache: up to 6 Mbytes; On-chip memory controller
• Bus Interface 6.4 Gbytes/s, 128 bit
• Power management Variable voltage and clock
• Features x86 emulation mode, bi-endian support, Processor Abstraction Layer (PAL), 4-Gbyte maximum page size, zero clock branch prediction

A 64-bit processor comes in handy in embedded applications, as well as in workstations and servers. The PowerPC has found a home on high-end servers driven by IBM's Power4 and Power5 chips, down to embedded processors like the PowerPC 970 with high-performance AltiVec SIMD support. That's not bad for an architecture that maintains compatibility with most 32-bit PowerPCs.

The PowerPC has a classic RISC architecture. An out-of-order, speculative superscalar design keeps it pumping out up to 8.5 instructions per cycle. Still, by keeping the design simple, it's possible to pack two processor cores with up to a dozen execution units apiece in one chip. Multithreading in future products will let the chips handle even more threads simultaneously.

SIMD performance has made the PowerPC the architecture of choice for a variety of applications. It offers greater power and flexibility than SIMD instructions found in many other processors designed to handle multimedia processing. The AltiVec SIMD support is more general with features like 32 dedicated 128-bit vector registers, four register operands, 162 vector instructions, and concurrent scalar floating-point operations. No restrictions on the use of floating-point registers or in context switching are required. Some implementations incorporate two execution units and support up to two concurrent vector operations per cycle.

The PowerPC 970 doesn't use HyperTransport, but there are some similarities to its processor bus. The bus is 32 bits with a pair of unidirectional 35-bit channels providing up to 7.1 Gbytes/s of usable bandwidth that support SMP processor synchronization and cache coherency and with supports for out-of-order data transfers. Additionally, there are sideband signals that efficiently support snoop coherency operations for SMP configurations.

The 64-bit PowerPC architecture successfully addresses a wide range of applications, making it one of the most flexible solutions available.

• Target Embedded, servers and PC
• Availability IBM, Motorola
• Architecture RISC
• Operating systems Linux, Unix, Mac OS, plus a wide range of RTOSs
• Core Issue up to 10 instructions/cycle, Finish up to 5 instructions/cycle, 10 pipelines, 5- to 13- stage
• Execution units 2 integer, 2 FP, 2 load/store, 1 branch, 2 SIMD, 1 condition
• Register file 32 integer, 32 floating point, 32 128-bit vector
• Instruction set 32-bit RISC/AltiVec SIMD
• Memory Level 1 cache: 64-kbyte code, direct 32-kbyte data, 2-way; Level 2 cache: integrated 512-kbyte, 2-way
• Bus Interface 7.1-Gbyte/s usable bandwidth
• Power management Variable voltage and clock
• Features Multicore, multithreaded; supports 32-bit PowerPC code
• Note These specifications are for the IBM PowerPC 970

MIPS Technologies recognized early on that the desktop and server market was getting a bit crowded, and the embedded environment was growing tremendously. It's now one of the dominant 32- and 64-bit vendors in this arena. But don't look for MIPS chips from MIPS Technologies because it licenses designs to vendors like Broadcom.

Single-processor-core MIPS chips have been the norm, but its compact design and rising transistor counts have led to quad processor chips. MIPS architects have kept a number of tenets in mind, such as "don't add instructions to an architecture that would impede implementation." A streamlined design leads to higher performance with minimal hardware.

Because of the IP-oriented nature of MIPS Technologies, the incarnation of MIPS processors is quite varied. Low-end 64-bit processors may incorporate only a level 1 cache, while others may implement a four-way superscalar core. Minimizing cache complexity can improve determinism that's necessary in many embedded environments. In many instances, custom instructions are added to the standard MIPS instruction set. The 64-bit architecture supports compact 32-bit instructions, as well as 64-bit instructions and even DSP-style instructions.

MIPS' multithreading support includes two key features: virtual processor emulation (VPE) and fine-grain thread support. These enhancements to the architecture raise the number of simultaneous threads that a particular implementation can handle.

The embedded arena has different demands compared to desktop and servers. Low interrupt latency, fast task switching, and bit-manipulation instructions are key to the success to the MIPS64 architecture.

• Target Embedded
• Availability IP from MIPS, chips from a variety of sources
• Architecture RISC
• Operating systems Wide variety of embedded RTOSs, including Linux and Windows CE
• Core Single pipeline up to 5-stage, 4-way superscalar
• Execution units 1 integer, 1 FP, 1 load/store, 1 multiply/divide
• Register file 32 integer, 32- by 64-bit FP (optional)
• Instruction set 32/64-bit RISC/SIMD (optional)
• Memory Level 1 cache: chip-specific, 4-way; Level 2 cache: chip-specific; On-chip DDR memory controller; Virtual or physical addressing
• Bus Interface Chip-specific
• Power management Variable voltage and clock
• Features DSP-style instructions, fixed-point arithmetics, 3-operand instruction format, compatible with MIPS32, co-processor support

Look under the hood of large enterprise clusters and the engine you'll find is Sun's 64-bit UltraSparc processor. One of the first successful RISC architectures, the UltraSparc has been available in a variety of incarnations. Occasionally, it has found a home in high-end embedded applications, but its real calling is high-performance workstations and servers.

The architecture uses a register window design instead of a memory-based stack. This proven technology is similar to the one adopted in Intel's EPIC architecture. The UltraSparc has included a number of features now appearing in other 64-bit designs, such as an integrated memory controller. Its 4-Mbyte maximum page size helps keep large memory applications (e.g., database servers) humming. The UltraSparc III and IV are designed to work well in multiprocessor systems from dual-processor systems through systems with hundreds of processors.

The newest UltraSparc design employs a dual-processor core design. Each can execute four instructions using a superscalar design. Niagra, the next-generation design, promises to incorporate eight processor cores on a single chip, with each core handling up to four threads at a time for a whopping total of 32 threads per chip.

While good compilers are key to the success of many 64-bit processors, the Solaris operating system has been key to the success of the UltraSparc. Managing large multiprocessor systems isn't simply a matter of increasing the size of a control array. Of course, keeping Solaris on top means that the UltraSparc architecture must keep delivering the underlying hardware performance.

• Target Servers and workstations
• Availability Sun
• Architecture RISC
• Operating systems Solaris, Linux, assorted RTOSs
• Core 4 instructions/cycle, 6 execution pipelines, 14-stage
• Execution units 2 integer, 2 FP, 1 load/store, 1 branch
• Register file 160 integer (8 register window), 32 FP or 16- by 128-bit FP
• Instruction set 32-bit RISC/SIMD
• Memory Level 1 cache: code 32-kbyte, 4-way, data 64-kbyte, 4-way; Level 2 cache: off-chip, 8-Mbyte maximum, 2-way; On-chip memory controller
• Bus Interface Sun Fireplane
• Power management Variable voltage and clock
• Features Virtual memory pages up to 4 Mbytes

Need More Information?




MIPS Technologies


Sun Microsystems

Texas Instruments

TAGS: Intel
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.