Enhanced Technology Moves The x86 Into The 21st Century

The Intel x86 architecture has had a long and sometimes convoluted history. Challenged by the advanced RISC architectures with their very-long-instruction-words (VLIWs), it may be kept out of high-end applications. Still, the x86 architecture has a staying power that not only keeps it around, but allows it to flourish even in new areas, such as low-end embedded applications.

Originally available from a single source, the x86 architecture started as the 16-bit, multiple-clock/instruction 8086. Today, x86-architecture processors are available from multiple sources. Performance-oriented chips use highly pipelined execution units, and the architecture is moving from 32 bits into the 64-bit realm. While the original segmented-memory architecture of the 8086 remains, the flat memory space is the choice of environments for most x86-based operating systems like Windows ME, Windows 2000, and Linux.

All of this growth in the x86 space is due to such companies as Intel Corp., Advanced Micro Devices Inc. (AMD), VIA Technologies Inc., and Transmeta Corp. bringing the architecture to places where it hadn't been before. Intel is pushing the state-of-the-art pipelining to keep performance climbing. In addition, it has merged the x86 architecture with its new 64-bit Itanium processor line. But, this hardware support is more for compatibility rather than a long-term 64-bit plan for the x86 architecture. AMD, on the other hand, has pushed the x86 architecture into the 64-bit realm using new instructions and registers.

Adding new instructions and registers to the x86 architecture isn't a new phenomenon. In fact, the 8086 architecture first advanced along this path with the integration of a floating-point unit. Later advancement came with the addition of single-instruction/multiple-data (SIMD) instructions. These are part of Intel's multimedia extension (MMX) support and AMD's 3DNow! instructions.

The x86 architecture is being pushed and prodded from all sides. Intel's new Pentium 4 line sticks with the 32-bit architecture, but packs it with a 20-stage hyperpipeline to provide in-creased performance. AMD, alternatively, has pushed the x86 architecture into the 64-bit space. It will be interesting to see how this arena takes to the x86 architecture, especially given Intel's preference of combining its 32-bit x86 core with its 64-bit VLIW core in the 64-bit Itanium line.

The VLIW approach has cropped up more than once with the x86 architecture. The Transmeta Crusoe, for example, utilizes a VLIW core to execute x86 instructions, but it does this through a process called code morphing. The approach is significantly different from Intel's Itanium approach.

Power and performance aren't the only areas where innovation with the x86 architecture can be found. Higher integration with peripheral and peripheral-support chips is taking the x86 architecture into the embedded space, which is dominated by a large collection of different processors.

New implementations of the x86 architecture are more compact and use less power than previous versions, making them a real alternative to non-x86 designs. The x86 architecture is showing up in a number of system-on-a-chip (SoC) designs. Many of these provide PC compatibility, incorporating everything from parallel printer port interfaces to Universal Serial Bus (USB) hubs.

VIA Technologies incorporates the Northbridge support with the processor while National Semiconductor's GX1-based products incorporate a two-dimensional (2D) video accelerator. Furthermore, STMicroelectronics' STPC and Rise Technology's SCX501 combine video support with an x86 core. ZF Linux Devices' MachZ provides an interesting twist that lets designers use its cache as conventional memory so that the system can always boot.

Given this variety of x86 implementations, it's best to begin by examining one. Intel's Pentium 4 is a good place to start, as it's the successor to the popular Pentium III.

Hyperlining And Other Magic Intel's Pentium 4 maintains the logical x86 architecture supported by the Pentium III, but its internal architecture is significantly different from the one in the Pentium III. Called the NetBurst microarchitecture, it's actually much different from most x86 processors (Fig. 1).

The Pentium 4 has a 3.2-Gbyte/s system bus interface that provides access to the external 40-MHz system bus. The Pentium 4 processor speed starts at 1.4 GHz.

Inside the Pentium 4 is an advanced transfer Level-2 (L2) cache, an execution trace cache, and support for streaming single-instruction/multiple-data Extensions 2. This includes 144 new instructions and 128-bit register support, along with rapid execution engines that run at half a clock tick per instruction. It also features enhanced multimedia and floating-point support.

Tying this all together is a rather deep, 20-stage execution pipeline that's twice the length of the P6 pipeline in the Pentium III (Fig. 2). This hyperpipelined technology allows the Pentium 4 to execute more than one instruction per clock cycle.

The speed of the pipeline is such that wire delays must be included. Likewise, some stages must be repeated to handle the amount of traffic in the pipeline and the complexity of the job. This allows designers to fine-tune the pipeline.

MicroOPs flow through the pipe-line. The microOPs are issued based on the x86 instructions that are employed at the beginning of the pipeline, where they are decoded and placed into the execution trace cache. The cache has room for approximately 12k microOPs.

The trace cache is unique in that it actually caches microOPs based on the instruction address. The cached information includes execution order information allowing predicted branches to be cached along with the code following the branch. Conditional code is executed in a speculative fashion. This permits differences with respect to branch conditions to cause unnecessary results to be discarded. The Pentium 4, therefore, can have as many as 126 instructions in process at one time.

A long pipeline is both an advantage and a disadvantage. While the pipeline remains filled, the Pentium 4 chugs at top speed. Unfortunately, a bad guess on a branch requires flushing the pipeline and, of course, it will take longer to refill than a shorter pipeline. This is one reason that branch prediction is so important on the Pentium 4. The long pipeline is an advantage for many applications, but it may not provide enhanced performance for office or scientific applications.

While the Pentium 4 retains the x86 register architecture, internally it has 128 registers. The register-renaming stage of the pipeline is implemented to map the logical x86 registers to these registers. The large number of registers lets up to 48 speculative loads be in process, greatly increasing overall system parallelism. The other reason for a large number of registers is to have information for multiple microOPs be handled simultaneously by different execution units.

Another area where the Pentium 4 is a step above the Pentium III is in branch prediction. This is important because the longer the pipeline is, the more incentive exists to minimize flushing the cache due to a bad prediction. The Pentium 4 provides a better than 30% improvement.

The Pentium 4 also uses a low overhead replay or re-execute mechanism for microOPs that were incorrectly executed as a result of speculation. Dependent microOPs are replayed when a microOP must be replayed.

The Pentium 4 represents a significant jump forward in performance as well as complexity. Still, it retains the 32-bit architecture whereas strides are being made in 64-bit solutions too.

One of the most notable departures from the 32-bit x86 architecture is AMD's x86-64 64-bit architecture (Fig. 3). The x86-64 has both 32-bit and 64-bit operating modes and 64-bit registers. The 64-bit operation is simply an extension of the 32-bit CISC in-structions of the conventional x86 architecture.

The 64-bit support is a requirement for high-end applications, like database servers, CAD, and EDA. These applications exceed the address space of 32-bit processors. This space is currently dominated by non-x86 64-bit architectures.

The x86-64 provides an environment similar to Intel's Itanium. The big difference is that the 64-bit instruction set of the x86-64 is comparable to its 32-bit instruction set. With Intel's Itanium, the 32-bit and 64-bit modes are distinct, each having its own dispatch and decoding support. The 32-bit side uses a CISC instruction set while the 64-bit side uses a VLIW instruction set.

The x86-64 operates in a 64-bit-long mode or a "legacy" mode. A long mode requires a new 64-bit operating system. But, it can run x86 protected-mode applications. The legacy mode employs a conventional 32-bit operating system, allowing the x86-64 to be used with an existing operating system without changes to the software. The legacy mode doesn't support 64-bit instructions or registers.

The long mode has a 64-bit mode plus a compatibility mode. The 64-bit mode uses a flat, 64-bit address space. It provides access to the extended 64-bit registers, such as the 64-bit instruction pointer. Additionally, 64-bit applications can address the new registers.

The compatibility mode lets 16- and 32-bit applications run unchanged. Only the operating system needs to handle 64 bits. The operating system can switch between 64-bit and compatibility modes, allowing a mixture of applications to run on the x86-64 architecture.

Access to the new streaming SIMD extension (SSE) registers and instructions is provided by the 64-bit mode. The SSE support is available both in legacy and compatibility modes.

The 64-bit instruction set uses variable-length instructions that match the x86 instruction set. An optional prefix to an instruction is the new REX prefix byte. It's used to specify the 64-bit mode employed by the instruction and to provide additional address bits for register selection. REX is comparable in function and format to the MODRM suffix that follows an instruction byte. The MODRM specifies addressing modes and registers, although it has insufficient bits to address the new registers in the x86-64 architecture. These new registers are accessed by combining bits from both bytes, if they're part of the instruction stream.

This approach generates some convoluted bits, but the complexity is easily hidden by an assembler or compiler. There's still a legacy instruction-length limit of fifteen bytes that includes the REX byte.

A number of other instructions have been added to take on new semantics when the processor is operating in the 64-bit mode. For example, the new relative branches deal with the entire 64-bit instruction pointer register, eliminating the requirement for a REX prefix. Another example is the SYSCALL and SYSRET instructions that take the 64-bit mode into account by default.

There are several advantages with AMD's approach to ex-tending the instruction set. First, changes to compilers are significantly simpler than changing architectures as you would with Intel's Itanium. Second, development tools like debuggers also require minimal changes.

Plus, the x86-64 starts up running 32-bit applications. This allows the x86-64 to boot an existing 32-bit operating system. Though the 64-bit support isn't available, it will allow evaluation of the hardware in 32-bit mode until a 64-bit operating system is available.

Other architectural implementation details aren't readily available. The first incarnation of the x86-64 architecture isn't expected until the end of 2001.

Intel's Itanium processor has a fresh start when it comes to 64-bit processing (Fig. 4). Its VLIW architecture is brand new and doesn't have a limitation based on the x86 architecture with respect to the VLIW portion of the processor.

On the other hand, the Itanium retains x86 compatibility by incorporating x86 support on the processor. The 32- and 64-bit portions of the processor are linked by two common subsystems. The first is the instruction cache. Using a common cache allows 32-bit applications to utilize a large cache.

The second common subsystem is made up of the execution units. Having these units do double duty allows 32-bit applications to take advantage of 64-bit execution units.

Implementing common execution units means that only one type of application can run at a time with a 64-bit operating system handling the switch. Branch resolution is still handled by 32-bit-specific support. Yet this isn't surprising, as decoding and dispatching for the 32-bit code is distinct from the 64-bit support.

In this sense, the Itanium isn't very different from the AMD x86-64. Both require new compilers, operating systems, and applications in order to take advantage of 64-bit support. Both run 32-bit x86 code, but the preference for performance will be the newer 64-bit code. Running 32-bit applications quickly will be important, but not as high a priority as running 64-bit applications.

The big question is whether the VLIW Itanium architecture will have an edge over the CISC x86-64 architecture. VLIW architectures promise to provide better performance, although even Intel is still pushing the Pentium 4 to new heights. AMD is simply doing the same thing with 64 bits.

Transmeta's Crusoe is an interesting combination of VLIW hardware and software. It's really a 32-bit VLIW processor that supports only x86 applications and operating systems. From a user's standpoint, the Crusoe is effectively an x86-style processor. An application programmer sees the same thing. Only the Transmeta programmer sees something different.

The Crusoe doesn't execute or interpret x86 code. Instead, it compiles the x86 code into Crusoe VLIW code. Transmeta refers to this process as code morphing.

The Crusoe starts out by running VLIW code normally copied into RAM. This code consists of two main components. One is the VLIW operating system designed to handle x86 applications, including x86 operating systems. The other component is the code-morphing compiler.

Memory is divided into four areas (Fig. 5). The two components just mentioned fill two of them. A third is the code-morphing cache. The fourth and largest area is the x86 memory. This is what the logical x86 processor has access to. The limits are placed on the logical processor by the code-morphing compiler and the VLIW operating system because they effectively implement an x86 virtual-memory system.

The virtual execution of x86 code begins by loading a copy into memory or running x86 code from ROM, typically an x86 boot ROM. At that point, the code-morphing compiler takes over. It checks out the block of code to be executed and compiles it to the VLIW code that's placed in the VLIW cache. This block of VLIW code is then executed. Any exit points from the code effectively restart the code-morphing process, unless the matching x86 code has already been morphed into a block in the VLIW code-morphing cache.

This process repeats indefinitely. Eventually the cache will fill up and compiled blocks will be freed for new code. In this sense, the cache works very much like the microOP cache in the Pentium 4. The big difference is that Crusoe keeps a very large cache in main memory while the Pentium 4 has a fixed-size cache on-chip. Also, the Pentium 4 has a fixed x86-to-microOP decoder while the Crusoe's compiler is in main memory. The Crusoe gains flexibility with its approach while the Pentium 4 has hardware performance. The Crusoe can keep up and possibly gain an advantage if the active portion of an application can fit into the cache.

Transmeta's approach has some distinct advantages. In particular, the entire process is handled by software. If a bug is found in the VLIW code, it can be fixed, usually by a flash-memory upgrade. Likewise, an improvement in the compiler or VLIW operating system can provide a matching improvement in overall system performance.

One interesting aspect of the code-morphing compiler's construction is that blocks of VLIW code are tracked for usage, with frequently executed blocks being recompiled with higher levels of optimization. The tradeoff is an addition compilation and a longer overall compile time, increasing the amount of overhead. But this is contrasted against the performance improvement gained. Whether or not the additional overhead is worth the improvement in speed is dependent upon the application and the compiler.

Through Transmeta's approach, the underlying VLIW processor architecture doesn't have to be fixed for a product line. The two chips that Transmeta developed have different VLIW architecture characteristics. Both are able to present the same x86 architecture by having VLIW software that hides the underlying physical architecture.

Separating the underlying hardware from the logical x86 architecture provides Transmeta's designers with the freedom to design new processors that incorporate different hardware. For example, a higher-performance processor may be built with more execution units. The VLIW software just has to take the additional execution units into account when morphing the x86 code in order to take advantage of these units. Conventional VLIW architectures, like the Intel Itanium, are limited by the portions of the architecture that they expose.

Another area where Crusoe could excel would be in a 64-bit environment. In theory, it could use the AMD x86-64 code-extension approach. Although this is speculation, Transmeta has pushed the technology in one area using this type of approach. The area is power consumption.

Long Run, Low Power Transmeta calls this feature Long Run. Effectively, the Crusoe's code-morphing compiler takes power-consumption considerations into account. This is combined with the Crusoe's ability to throttle down the speed and voltage levels to reduce performance and power consumption.

Essentially, the compiler takes a look at the kind of x86 code that it's compiling and later tracking, and adjusts the power settings via the VLIW code that has been generated.

This approach is transparent to the x86 applications and operating systems. It doesn't prevent more conventional power-down support where either an application or the operating system actively reduces power settings. For example, a laptop user may set up the system to run at half speed when on batteries and full speed when connected to a charger. Sleep and hibernation modes fall into a similar category.

Transmeta has touted some impressive numbers for power savings, but these should be considered in the context of the entire system. The power performance is quite impressive because the numbers are associated with the processor that typically utilizes a fraction, possibly a large fraction, of the total power used by a system. Many low-power alternatives, especially SoC-based solutions, provide similar or possibly better power savings depending upon the overall environment.

AMD's Athlon has performed well against Intel's Pentium III. Faster versions for Slot A and chip sockets are being delivered but the basic architecture remains the same. Still, it's impressive even compared to the Pentium 4.

Incorporated by the Athlon is a nine-issue superpipelined, superscalar x86 processor microarchitecture. There are three out-of-order, superscalar, pipelined integer units and three out-of-order, superscalar, pipelined address-calculation units. The control unit can have up to 72 active instructions.

The Athlon has multiple, parallel x86 instruction decoders and three out-of-order, superscalar, fully pipe-lined floating-point execution units. The floating-point units additionally handle the MMX and 3DNow! instructions. The latter has 19 new instructions designed for speech and video encoding. Plus, five new DSP instructions improve soft-modem, soft-ADSL, MP3, and Dolby Digital Surround Sound applications.

The architecture has its own brand of advanced dynamic branch prediction. With support for 8-bit error checking and correction (ECC), the Athlon system bus runs at 200 MHz. Like the Pentium line, the Athlon has SMP support.

The Intel Celeron, the AMD Duron, and the VIA Technologies Cyrix III represent mid-range x86 architectures. Designed to be inexpensive and cranked out in massive quantities, these processors tend to show up in desktop and laptop systems that are priced under $1000. Although they typically have a smaller cache than the high-end x86 processors, it's well known that even a small cache makes a big difference in performance.

These processors are found in embedded systems too, although cost- or space-conscious designs often revert to the SoC x86 solutions, which will be discussed later.

Intel's Celeron uses the 66-MHz P6 Microarchitecture Dynamic Execution Technology that's the basis for the Pentium III as well. The Celeron has a 32-kbyte L1 cache and a 12-kbyte L2 cache. The Celeron uses speculative execution, superscalar execution units, and pipelining. But, as with most mid-range designs, it's less aggressive than the high-end Pentium 4.

Test and monitoring support is important. The Celeron's built-in self test (BIST) provides single stuck-at fault coverage of the microcode and large logic arrays. Plus, it tests the cache, ROMs, and translation lookaside buffers (TLBs) on-chip. JTAG support allows testing of the Celeron in conjunction with other JTAG-enabled peripherals within a system design.

AMD's Duron is comparable to the Celeron, with some exceptions. The Duron has a large 128-kbyte L1 cache and a smaller 64-kbyte L2 cache.

The Duron has a nine-level superpipeline architecture with multiple instruction decode units. It has three independent integer pipelines and three independent address-calculation units. The three-way floating-point engine supports out-of-order execution.

MMX and 3DNow! instructions are supported by the Duron. The chip's 200-MHz system bus provides a performance edge.

Cyrix found its way into the VIA Technologies fold. VIA has continued to sell and develop the Cyrix line of x86 processors, including the Cyrix III. Long gone is the PR performance rating system. Now the Cyrix III competes with Intel and AMD products in terms of bus and processor clock speed.

The Cyrix III has a Socket 370 interface. It's designed to compete with the Duron and Celeron, especially when it comes to low-power applications. This system has a 128-kbyte L1 cache and uses a 100/133-MHz front-side system bus. It supports MMX and 3DNow! instructions too. The Cyrix III targets low-end PC system designs as well as high-end embedded designs. But, it has lots of competition from new integrated solutions.

The idea of placing a PC-on-a-chip (PCoC) isn't new. Yet, the level of integration available using the latest technology makes it very appealing and cost effective. This is especially the case for embedded applications, like Internet appliances. The latest set of products uses Pentium-class cores.

National Semiconductor's GX1 is one of the x86 products targeted at this space (Fig. 6). Although the GX1 requires additional chips to reach the PC-compatibility level, the mix-and-match approach allows design flexibility.

A 2D accelerated video adapter is incorporated with an x86 core in the GX1. The memory and PCI bus interface is included on-chip, making it an SoC. A Southbridge chip is necessary if PC compatibility is required because the GX1 lacks PC peripheral support, such as USB and keyboard interfaces.

National Semiconductor uses the GX1 core in the SC1200, SC2200, and SC3200 chips, which are PCoCs. (See "System-On-A-Chip Line Features Pentium Core And 2D Video Acceleration," p. 72.) These include peripheral support like a USB hub or AC'97 audio support. Each of the chips wraps further video support around the GX1 2D accelerated graphics support.

A Version For Set-Top Boxes For instance, the SC1200 is specifically designed for the TV set-top box market. Its video support includes a video overlay processor, and the chip has a video input port. This configuration allows an application to present information on the television screen with minimal, mostly passive, external components.

The SC2200 is the design closest to a conventional PC. Its video support provides outputs to either an LCD or a VGA monitor. This design is targeted at low-cost, low-power laptop applications and thin-client desktop applications. Also supported by the SC3200 is an LCD, but it's designed for more compact operations, such as those in a handheld device, where a VGA connection is unnecessary.

Another PCoC solution is ZF Linux's MachZ chip. Though it's based on a 486 core, it's wrapped with PCI Northbridge support. This provides external compatibility while retaining low-power performance.

The MachZ includes a full complement of PC peripherals. It doesn't incorporate a video controller, however, so it's at least a two-chipper. But it provides designers with more flexibility.

As with the National Semiconductor solutions, the MachZ PCoC has USB, IDE, PCI, and PC peripheral support. The same is true for interrupt and DMA support. MachZ, though, has some unusual and useful architectural system enhancements.

The first is the FailSafe boot support. Essentially, this is a preboot sequence that checks the system configuration. It utilizes the processor's cache as RAM, so the system will boot even if the external RAM isn't installed or is defective. The boot process checks various systems and can invoke a download from any one of many sources, like a serial port based on configuration settings. This is especially useful when a flash-memory update becomes interrupted. With a conventional system, the system would normally be dead in the water because the flash-based application couldn't run. With FailSafe, the problem is detected and the download is restarted. Dual watchdog timers also help detect problems that FailSafe can then attempt to correct.

The second item is the Z-TAG interface. It provides a high-speed link for downloading information. Typically, it's used for system initialization and is supported by the FailSafe boot feature.

From Rise Technology, the SCX501 is similar to National Semiconductor's SC1200. It incorporates the same kind of peripheral support, including the memory controller and 2D video adapter with television video overlay support. Many of the PC-compatible peripherals are dispensed to simplify the chip. The SCX501 has both ISA and PCI bus support.

STMicroelectronics' STPC line is very much like the SCX501. That's not surprising because Rise and STMicroelectronics have signed a codevelopment agreement. STMicroelectronics provides customization of the product line for applications like Internet appliances. Both companies use the Rise core designed for very low power and high performance.

Enhancements Versus Complexity Though not specifically related to the x86 architecture, the x86 SoCs and PCoCs represent one avenue of x86 system enhancements, which are tightly coupled with the x86 architecture. Often these enhancements are more important to embedded-system designers than how complex the execution pipeline is or whether or not the underlying system is a VLIW processor.

Additional features will continue to show up in these integrated solutions as chip geometries get smaller and as the demand for more powerful solutions grows. Ethernet and Bluetooth support are just two possible additions to this type of architecture.

The number of new enhancements and the interest in the market make it clear that the x86 architecture still has a lot of growth potential. RISC and now VLIW architectures will continue to compete with the x86 solutions. But the amount of software support, the large number of experienced x86 programmers and designers, and the quality of x86 processor design will continue to give the x86 architecture an edge.

Companies Mentioned In This Report
Advanced Micro Devices Inc. (800) 538-8450 www.amd.com Intel Corp. (408) 765-8080 www.intel.com National Semiconductor Corp. (408) 721-5000 www.nsc.com Rise Technology Co. (408) 330-8800 www.rise.com	STMicroelectronics Inc. (408) 452-8585 www.st.com Transmeta Corp. (408) 919-6800 www.transmeta.com VIA Technologies Inc. www.viatech.com ZF Linux Devices (650) 965-3800 www.zflinux.com