High performance is relative. A 64-bit processor is typically faster than a 32-bit processor, but not always. A 3-GHz, 32-bit processor will run rings around a 200-MHz, 64-bit processor, but speed is only part of the puzzle. Power consumption, compact code, small physical size, and reliability all come into play when considering a system design. That 64-bit processor may be just the thing for a high-performance MP3 player.
Processor designers deal with a variety of tradeoffs when developing high-performance systems. While most companies claim that all their features deliver high performance, only some tend to be unique. For example, the use of caching is almost universal for 32- and 64-bit processors. Likewise, single-instruction, multiple-data (SIMD) instructions are the norm for processors in multimedia environments.
Target-specific features like SIMD multimedia support are often included in a processor design because they give software developers a flexible programming environment. This is taken to the extreme with Java hardware acceleration where an entire environment gains from this support (see "Hardware Speeds Up Java," right). Java acceleration highlights the way that a feature can be implemented in varying degrees across a range of solutions to provide different levels of performance.
The table shows a range of features that we will examine in more detail. They span the processor spectrum from 8-bit microcontrollers through 64-bit multiprocessors. Some features address system interconnects using new high-speed interconnect standards like HyperTransport and RapidIO that are necessary if fast processors are to quickly exchange data with the outside world (see "Getting Data On- And Off-Chip Faster," p. 50).
A 64-bit device usually implies a high-performance system that often uses high-speed interconnect technology. This is a good place to begin examining high-performance processor features.
64 Bits—Big Integers, Address Space: All computing could be done with a 1-bit processor if it were infinitely fast. Because that's not the case, designers and programmers have pushed the register since 4-bit processors were born. The 64-bit powerhouses have proven their worth in areas like high-end servers and workstations that require a large address space and in embedded environments where integer computations benefit from a large number of bits.
A wide, 64-bit word means that single instructions can handle address calculations and one register can store addresses. This reduces the number of instructions that must be executed to finish a job, increasing overall performance.
The MIPS64 from MIPS cuts the number of instructions required with its SIMD floating-point support. Most processors that handle SIMD address only byte or integer data. The MIPS64 floating-point support eliminates the need for specialized floating-point hardware for applications such as radar data processing.
Of course, integer SIMD is still very important in a variety of environments. The SH5 from SuperH implements a four-way SIMD processing unit that greatly speeds up SIMD data computations. Integer SIMD is important to handling multimedia data.
Intel chose a more radical approach to speeding up the system with its Itanium and the very-long-instruction-word (VLIW) Explicitly Parallel Instruction Computing (EPIC) architecture. EPIC places the job of scheduling instructions to its multiple computational units on the compiler instead of having the processor do this during program execution. The theory is that a compiler can do a better job with extensive static analysis of a program than a processor could do while the program is running. Unfortunately, it sacrifices performance with 32-bit x86 programs when running in compatibility mode. But most EPIC systems are expected to run new 64-bit code.
The EPIC architecture has been used primarily in high-end servers and workstations because of its high power requirement and cost. It probably won't find its way into embedded applications, but its architectural features may. Transmeta's Crusoe uses a VLIW architecture but hides its existence from programmers by presenting an x86 execution engine. The result is a low-power system that's ideal for embedded and portable applications.
IBM PowerPC also tries to keep its processor humming with fast memory accesses and uses new process technology to keep things moving. Its copper connections and silicon-on-insulator technology reduce component size and increase connection speed—key features for embedded applications.
AMD's Opteron employs a range of high-performance features. One of its more novel components is its use of HyperTransport for shared memory access and peripheral support. The processor contains not one, but three HyperTransport links. The chip has a built-in dynamic memory controller, so each processor shares local memory with other processors via the HyperTransport links.
The memory architecture is called ccNUMA, for cache coherent nonlocal memory access. Caches on all processors maintain up-to-date information, and accesses to nonlocal memory take more time than accessing local memory.
Nonlocal data is forwarded through a mesh of processors with HyperTransport links until it reaches its destination. The impact of this overhead is minimized by caching, moving blocks of data in a packet, and prefetching. Effective nonlocal access times result. They're only a fraction longer than local access times.
32-Bit Power: Motorola's MPC8540 PowerPC doesn't use its RapidIO connections to address shared memory environments, though it takes advantage of the high-speed interconnect to connect to off-chip devices. It still has a built-in PCI-X interface and other built-in peripherals. But this is due more to the embedded nature of the chip's target market where a one-chip solution is frequently the answer. The RapidIO port offers significant expansion capabilities many one-chip solutions lack.
Embedded processor designers try to eke out even small improvements to gain an edge. Arm's ARM10 uses a variety of techniques, including its return-stack caching mechanism. In this case, register 14 is used to maintain the return address of a subroutine call. Without this caching mechanism, the resulting indirect jump performed when returning from the subroutine will cause a cache miss and a pipeline flush, which will slow down the return process. The mechanism keeps the instruction that will be executed after the return in the cache, so the pipeline can continue to be filled.
Intel's Pentium 4 designers know about the impact of flushing a pipeline. Its hyper-pipeline architecture is twice the size implemented in previous Intel designs. By executing a wide range of decode and execution actions simultaneously, it extends the advantages of pipelining. It allows the chip's multiple rapid execution engines to operate in half a clock cycle.
The Pentium 4 also adds an execution-trace cache system that operates at the micro-op level. Typical caches simply keep the normal program instructions in local memory and decode them when the instruction is executed again. The Pentium 4 performs this decode but then caches the micro-ops generated by an instruction so the decode process won't have to be repeated.
16-Bit Microcontrollers: Motorola's HCS12 16-bit microcontroller family employs a specialized queue of only three instructions. This is sufficient due to the way the programs are executed from flash memory, which operates at half the speed of the processor.
The trick was to split the flash memory in half so two words could be read at once while the processor was executing the current instruction, leaving one more instruction at the head of the queue. The system simply has to keep ahead of the processing unit to run at full speed. So making the queue larger would not affect performance, but using the queue doubles performance. Designers of 32- and 64-bit chips would love to be able to add a feature providing this kind of payback.
Cyan's eCog can take advantage of a larger cache in more than one way. Normally, a cache is maintained transparent to the executing program. The data cached is based on the frequency of execution. Often executed code is maintained in the cache, while code not executed recently will be discarded to make room for new information.
The eCog operates in the usual fashion. Yet, the program also can manipulate it so that portions of the cache can be locked down, preventing them from being discarded when additional information needs to be cached. This helps in many embedded environments where the designers know that rapid response is necessary for interrupts or applications like multimedia decoding. The approach essentially reduces the automatically maintained cache size but allows programmers to tune the system for faster response to anticipated actions. The ability to lock down information in the cache can also be used for debugging purposes.
Super 8-Bit Microcontrollers: The 16-bit high-performance processors keep pushing the 32-bit processors, and the 8-bitters still plug away at the 16-bit systems. Devices like Cygnal Integrated Products' 8051-based C8051F120 and Ubicom's IP2022 keep designers in the 8-bit realm even for high-performance microcontroller applications, such as wireless home gateways.
Cygnal's C8051F120 delivers 100-MIPS one-cycle instruction execution speed, versus the typical 6 to 12 cycles for other 8051s. The company incorporates prefetch buffers and a 256-byte cache similar to those found in 16- and 32-bit systems. Of course, the cache is smaller, but the performance gain is significant. The added speed lets the processor handle network communications easily.
Ubicom's IP2022 cranks single-cycle instructions at 120 MIPS. Its memory-to-memory architecture and tight integration with high-speed serial peripherals let it handle network-processing tasks that would drown most 16-bit processors. Many designers turn to 32-bit processors when the IP2022 would be sufficient.
More Is Better: Multiple processors are often used in high-performance systems because a single processor doesn't meet the mark. Multiprocessor architectures also provide incremental improvements or upgrades not supplied by single-processor solutions.
Broadcom's BCM1400 takes this a step further, packing four 64-bit MIPS CPU cores in one chip with three HyperTransport links. Multiple BCM-1400s can be connected together in a larger, ccNUMA system with the links.
Texas Instruments' (TI) OMAP product line takes a more targeted approach by pairing an ARM processor with a TI DSP. An ideal combination for portable and multimedia applications, the DSP lets each processor handle the types of applications it does best.
Intel's Xeon processor's hyperthreading technology provides a two-processor system with a single-processor chip. By adding some logic, the Xeon makes better use of the overall system, while the actual performance is less than a true dual-processor system.
Awareness of high-performance alternatives can save time and money when designing systems. Typically, a high-performance 8-bit solution costs less than the average 16-bit solution. The same is true when moving up the curve.
|Need More Information?|
Ajile Systems Inc.
Cyan Technology Inc.
Cygnal Integrated Products
MIPS Technologies Inc.
PCI-SIG PCI Express
RapidIO Trade Association
(408) 955 0202
Texas Instruments Inc.