Smaller transistors and larger die sizes are radically changing the way 64-bit processors are implemented and where they're found. This holds true particularly for high-performance server and desktop computing, but also for embedded applications where 64-bit performance and throughput make many jobs practical.
There's a surprising amount of variety in the 64-bit arena, even taking into consideration the target audience. Some architectures span the space from embedded to giant server clusters (e.g., the PowerPC architecture). Others span the compatibility space (e.g., MIPS Technologies' 32- and 64-bit embedded processor line). Most provide some level of upward compatibility (e.g., AMD's support of the x86 architecture). Then there are those that start a line of their own (e.g., Intel's Explicitly Parallel Instruction Computing, or EPIC, architecture).
Products such as AMD's Opteron and Athlon 64, Intel's Itanium 2, and Sun's UltraSparc III architectures are explicitly designed and marketed for PC systems. Others, like SuperH's SH-5 (see Drill Deeper 7364, "64-Bit IP Provides Embedded Solutions," at www.elecdesign.com), are only available as intellectual property (IP) to be incorporated into custom, embedded applications. We'll look at the architectures that can be found in off-the-shelf products. For example, Broadcom is just one of many vendors providing a number of standard parts that incorporate the MIPS64 architecture.
AND THE WINNER IS... Don't look for a dominant architecture. CISC, RISC, and the new EPIC designs are delivering equally impressive performance and are unlikely to disappear any time soon. Most surprising is that each approach is delivering comparable throughput. This still makes for lots of variance in performance based on what surrounds the core, including the amount and speed of the cache, the type and speed of the bus interface, and the silicon technology.Compiler technology is also very important to the performance of 64-bit processors (see Drill Deeper 7365, "Compilers Critical To CPU's Success," at www.elecdesign.com). This is especially true when utilizing Intel's EPIC architecture and the SIMD vector support in IBM's AltiVec enhancements to the PowerPC.
A number of common tactics is used in the design of these high-performance processors. The first is large, multiple caches. While it's possible to get a MIPS processor with only a level 1 cache, most systems have at least a level 2 cache. Some, like Intel's Itanium 2, have megabytes of level 3 cache. Cache is key not only for individual execution threads, but also to handle a large number of threads normally found in most application environments. The second common item is on-chip memory controllers. Moving memory closer to the core reduces latency.
This crop of 64-bit processors may not be on top in terms of numbers shipped, but they definitely come out on top when it comes to crunching numbers
THE 64-BIT x86 AMD aims to put a 64-bit processor on the desktop, laptop, and server. The Athlon 64 and Opteron have different names, yet they share a common AMD64 64-bit core that extends the x86 architecture in a fashion similar to past x86 migration, from the 8086 to the 80286. In fact, the 32-bit legacy mode simply makes the processor look like a very fast 32-bit Athlon processor.The AMD64 doubles the size and number of registers compared to the 32-bit Athlon and Pentium 4. AMD determined that 16 registers were the best combination for high performance, system overhead, and hardware real estate. The 64-bit registers are accessible in native 64-bit mode or in mixed 32/64-bit mode. AMD accomplishes this magic by including only three new instructions. Two are for mode changing, and one is a prefix byte that allows the CISC instruction stream to refer to the 64-bit registers. The average 32-bit instruction length is 3.2 bytes, whereas the 64-bit average only grows to 3.7 bytes.
HyperTransport is central to the AMD64 design. It provides high I/O bandwidth and doubles as a NUMA (non-uniform memory access) SMP (symmetrical multiprocessing) link that makes the creation of multiprocessor systems a snap. Single-processor incarnations have a single, non-cache-coherent HyperTransport link. Multiple processor chips have three cache-coherent HyperTransport links.
As with most 64-bit designs, the AMD64 increases performance through a number of methods such as the use of HyperTransport and a low-latency, on-chip double-data-rate (DDR) memory controller. A superscalar design with a number of execution units helps the AMD64 maintain high code execution performance.
The AMD64 architecture is new. Its success may push others to develop their own 64-bit x86 processors, but that's another story.
AMD ATHLON 64 AND OPTERON | |
• Target | Servers and PC |
• Availability | AMD |
• Architecture | CISC |
• Operating systems | x86-compatible OS, including Linux, Unix, Windows |
• Core | CICS, 6 instructions/cycle with double dispatch operations |
• Execution units | 3 integer, 3 address generation, 1 multiplier, 1 FP, 1 load/store, 1 branch |
• Register file | 16 integer, 8 FP, 16- by 128-bit media registers |
• Instruction set | 8-bit CISC/SIMD (MMX, 3D Now!, SSE, and SSE 2) |
• Memory | Level 1 cache: separate code/data 2- by 64-kbyte, 2-way; Level 2 cache: integrated 1 Mbyte, 16-way; On-chip DDR memory controller |
• Bus Interface | 16-bit HyperTransport (1 to 3 links) |
• Power management | Variable voltage and clock |
• Features | 32-bit x86-compatible mode, mixed 32/64-bit mode, built-in NUMA SMP support |
EPIC utilizes 32-bit RISC-like instructions, making a large cache critical to good performance as with most RISC designs. Itanium 2's three-level, on-chip cache configuration lets the chip deliver the kind of performance necessary for servers and high-end workstations. The level 1 cache operates at one clock cycle, while the level 2 cache operates at five to seven clocks.
The Itanium 2 adds execution units to the original Itanium. Applications are upward-compatible, but without recompilation programs, they will only take advantage of the number of execution units for which they were compiled. These programs hope to gain from other chip improvements, including a higher clock rate. The Itanium 2 has a good deal of resources for the compiler to exploit. This includes a large integer and floating-point register file with register stack management. The latter handles 96 registers within the register file.
Even with EPIC's software-optimized architecture, the core of the Itanium 2 is actually smaller than the 32-bit Pentium core. The Itanium 2's large chip size comes from the large level 3 cache.
As with most 64-bit designs, Intel will push its next generation into multiple on-chip processors. Multithreading will make the next dual-processor chip look like four processors.
INTEL ITANIUM 2 | |
• Target | Servers and workstations |
• Availability | Intel |
• Architecture | EPIC |
• Operating systems | Linux, Unix, Windows |
• Core | EPIC, 11 issue ports, 8-stage, 6 instructions/cycle |
• Execution units | 6 integer, 2 FP, 2 load/store, 3 branch |
• Register file | 128 integer, 128 FP 64 predicate registers |
• Instruction set | 32-bit EPIC/SIMD |
• Memory | Level 1 cache: integrated 32-kbyte code, 32-kbyte data, 2-way; Level 2 cache: integrated 256-kbyte minimum; Level 3 cache: up to 6 Mbytes; On-chip memory controller |
• Bus Interface | 6.4 Gbytes/s, 128 bit |
• Power management | Variable voltage and clock |
• Features | x86 emulation mode, bi-endian support, Processor Abstraction Layer (PAL), 4-Gbyte maximum page size, zero clock branch prediction |
The PowerPC has a classic RISC architecture. An out-of-order, speculative superscalar design keeps it pumping out up to 8.5 instructions per cycle. Still, by keeping the design simple, it's possible to pack two processor cores with up to a dozen execution units apiece in one chip. Multithreading in future products will let the chips handle even more threads simultaneously.
SIMD performance has made the PowerPC the architecture of choice for a variety of applications. It offers greater power and flexibility than SIMD instructions found in many other processors designed to handle multimedia processing. The AltiVec SIMD support is more general with features like 32 dedicated 128-bit vector registers, four register operands, 162 vector instructions, and concurrent scalar floating-point operations. No restrictions on the use of floating-point registers or in context switching are required. Some implementations incorporate two execution units and support up to two concurrent vector operations per cycle.
The PowerPC 970 doesn't use HyperTransport, but there are some similarities to its processor bus. The bus is 32 bits with a pair of unidirectional 35-bit channels providing up to 7.1 Gbytes/s of usable bandwidth that support SMP processor synchronization and cache coherency and with supports for out-of-order data transfers. Additionally, there are sideband signals that efficiently support snoop coherency operations for SMP configurations.
The 64-bit PowerPC architecture successfully addresses a wide range of applications, making it one of the most flexible solutions available.
IBM/MOTOROLA PowerPC | |
• Target | Embedded, servers and PC |
• Availability | IBM, Motorola |
• Architecture | RISC |
• Operating systems | Linux, Unix, Mac OS, plus a wide range of RTOSs |
• Core | Issue up to 10 instructions/cycle, Finish up to 5 instructions/cycle, 10 pipelines, 5- to 13- stage |
• Execution units | 2 integer, 2 FP, 2 load/store, 1 branch, 2 SIMD, 1 condition |
• Register file | 32 integer, 32 floating point, 32 128-bit vector |
• Instruction set | 32-bit RISC/AltiVec SIMD |
• Memory | Level 1 cache: 64-kbyte code, direct 32-kbyte data, 2-way; Level 2 cache: integrated 512-kbyte, 2-way |
• Bus Interface | 7.1-Gbyte/s usable bandwidth |
• Power management | Variable voltage and clock |
• Features | Multicore, multithreaded; supports 32-bit PowerPC code |
• Note | These specifications are for the IBM PowerPC 970 |
Single-processor-core MIPS chips have been the norm, but its compact design and rising transistor counts have led to quad processor chips. MIPS architects have kept a number of tenets in mind, such as "don't add instructions to an architecture that would impede implementation." A streamlined design leads to higher performance with minimal hardware.
Because of the IP-oriented nature of MIPS Technologies, the incarnation of MIPS processors is quite varied. Low-end 64-bit processors may incorporate only a level 1 cache, while others may implement a four-way superscalar core. Minimizing cache complexity can improve determinism that's necessary in many embedded environments. In many instances, custom instructions are added to the standard MIPS instruction set. The 64-bit architecture supports compact 32-bit instructions, as well as 64-bit instructions and even DSP-style instructions.
MIPS' multithreading support includes two key features: virtual processor emulation (VPE) and fine-grain thread support. These enhancements to the architecture raise the number of simultaneous threads that a particular implementation can handle.
The embedded arena has different demands compared to desktop and servers. Low interrupt latency, fast task switching, and bit-manipulation instructions are key to the success to the MIPS64 architecture.
MIPS MIPS64 | |
• Target | Embedded |
• Availability | IP from MIPS, chips from a variety of sources |
• Architecture | RISC |
• Operating systems | Wide variety of embedded RTOSs, including Linux and Windows CE |
• Core | Single pipeline up to 5-stage, 4-way superscalar |
• Execution units | 1 integer, 1 FP, 1 load/store, 1 multiply/divide |
• Register file | 32 integer, 32- by 64-bit FP (optional) |
• Instruction set | 32/64-bit RISC/SIMD (optional) |
• Memory | Level 1 cache: chip-specific, 4-way; Level 2 cache: chip-specific; On-chip DDR memory controller; Virtual or physical addressing |
• Bus Interface | Chip-specific |
• Power management | Variable voltage and clock |
• Features | DSP-style instructions, fixed-point arithmetics, 3-operand instruction format, compatible with MIPS32, co-processor support |
The architecture uses a register window design instead of a memory-based stack. This proven technology is similar to the one adopted in Intel's EPIC architecture. The UltraSparc has included a number of features now appearing in other 64-bit designs, such as an integrated memory controller. Its 4-Mbyte maximum page size helps keep large memory applications (e.g., database servers) humming. The UltraSparc III and IV are designed to work well in multiprocessor systems from dual-processor systems through systems with hundreds of processors.
The newest UltraSparc design employs a dual-processor core design. Each can execute four instructions using a superscalar design. Niagra, the next-generation design, promises to incorporate eight processor cores on a single chip, with each core handling up to four threads at a time for a whopping total of 32 threads per chip.
While good compilers are key to the success of many 64-bit processors, the Solaris operating system has been key to the success of the UltraSparc. Managing large multiprocessor systems isn't simply a matter of increasing the size of a control array. Of course, keeping Solaris on top means that the UltraSparc architecture must keep delivering the underlying hardware performance.
SUN MICROSYSTEMS ULTRASPARC | |
• Target | Servers and workstations |
• Availability | Sun |
• Architecture | RISC |
• Operating systems | Solaris, Linux, assorted RTOSs |
• Core | 4 instructions/cycle, 6 execution pipelines, 14-stage |
• Execution units | 2 integer, 2 FP, 1 load/store, 1 branch |
• Register file | 160 integer (8 register window), 32 FP or 16- by 128-bit FP |
• Instruction set | 32-bit RISC/SIMD |
• Memory | Level 1 cache: code 32-kbyte, 4-way, data 64-kbyte, 4-way; Level 2 cache: off-chip, 8-Mbyte maximum, 2-way; On-chip memory controller |
• Bus Interface | Sun Fireplane |
• Power management | Variable voltage and clock |
• Features | Virtual memory pages up to 4 Mbytes |
Need More Information? | |
AMD www.amd.com Broadcom www.broadcom.com IBM www.ibm.com Intel www.intel.com |
MIPS Technologies www.mips.com Motorola www.motorola.com Sun Microsystems www.sun.com Texas Instruments www.ti.com |