Electronic Design

Revamped Microarchitectures Let CPUs Deliver Top-Notch Performance

A quartet of CPU unveilings, showing off new microarchitectures and significant performance increases, was put before the audience at the recent Intel Developer Forum in San Jose, Calif. The Santa Clara company presented its forthcoming Pentium 4 CPU, Timna high-integration Pentium III CPU, modular StrongARM CPU core, and Itanium 64-bit processor.

The Pentium 4 CPU, drawing the most attention at the event, was designed to power next-generation servers and workstations. Code-named Willamette, its microarchitecture lets it operate at 1.4 GHz. But that's just the beginning. During various demonstrations, its clock frequency was cranked up as high at 2 GHz. With this kind of speed, the fastest yet demonstrated for a Pentium CPU, designers can create top-performing next-generation servers and workstations.

To achieve this performance, Intel's designers developed the NetBurst architecture. They implemented a hyperpipelined technology and a rapid-execution engine that works in conjunction with a 400-MHz system bus as part of the new microarchitecture (Fig. 1). The bus permits data to transfer into or out of the processor at up to 3.2 Gbytes/s, which is about three times the speed of the Pentium III CPU. A first-level data cache and a full-speed unified 8-way second-level data cache were both integrated on the chip.

Next, Intel's designers integrated an execution-trace cache that stores micro-operations and aids in system debugging operations. This helps the debug process by letting programmers better analyze the instruction flow and achieve better visibility into the CPU's internal operations. The first-level instruction cache also is fed by the unified on-chip 8-way second-level cache. Several versions of the CPU are planned. On-chip second-level cache sizes will range from 128 kbytes up to 1 Mbyte. External L3 cache support for cache sizes will range from 512 kbytes to 4 Mbytes.

Other CPU enhancements include an improved dynamic execution model, an advanced transfer cache, and an upgraded floating-point unit and multimedia support. The processor's instruction set includes a second generation of streaming single-instruction/multiple-data (SIMD) instructions to more efficiently perform many key multimedia algorithms. And, designers added 144 in-structions, including SIMD double-precision floating-point operations, SIMD 128-bit integer instructions, and others.

The NetBurst microarchitecture includes an advanced cache subsystem and better branch prediction. Data speculation and replay capabilities were added to speed instruction flow. The processor, then, can have over 100 instructions in flight at any time. To improve the pipeline throughput, the processor includes 48 load and 24 store buffers.

Powering a high-speed CPU like the Pentium 4 will significantly challenge system designers. A new standard for the voltage regulator modules, version 9.0, defines a unit that accepts a 12-V dc input and delivers a 1.7-V output with a current of 39 A when the CPU clocks at 1.4 GHz. To achieve such levels, the regulators use a multiphase design and an adaptive voltage-positioning scheme. Together, these approaches reduce the number of bypass capacitors and trim the power consumption at full load.

At the family's high end, the 64-bit Itanium was thoroughly discussed in a number of presentations. Many details of its EPIC architecture have been disclosed already. But designers attending the forum got to see for themselves a number of prototype systems in action. They also learned much more about the instruction set operation and programming techniques.

Moving from servers and workstations to value-priced desktop computers, Intel revealed much about its high-integration Timna processor. This Pentium III CPU uses technology from the current Katmai processor and the logic used in the GFX Northbridge chip (Fig. 2). Clocking at 800 MHz, its core talks to the outside world through a 200-MHz interface bus. An on-chip 400-MHz Rambus memory controller that provides a low-pin-count interface to RDRAMs supplies the memory interface. Using an off-chip memory protocol translator, the Rambus port can tie into 100-MHz SDRAMs.

Moreover, the processor includes a high-performance graphics controller and a 230-MHz RAMDAC. The RAMDAC's output can drive an RGB monitor with resolutions ranging from 320 by 200 up to 1600 by 1200 pixels at refresh rates of 85 Hz and pixel depths of up to 16 bits. On-chip graphics caches offer local caching and a texture cache to support 3D graphics applications.

Designers also have available a 12-bit digital-video output port on the CPU to drive a support chip that controls a digital flat panel (up to 1280 by 1024 pixels) or drives a National Television Standards Committee (NTSC) or PAL device (TV or VCR, for example). Targeting shared-memory frame-buffer graphics system architectures, the Timna cuts system costs by eliminating the need for a separate graphics frame-buffer memory.

For I/O support, the Timna chip employs a hub-type interface rather than the traditional PCI interface as the primary link between the CPU and the I/O subsystem. The hub interface consists of an 8-bit data bus that's clocked at 66 MHz but uses an Intel-proprietary quad pumping scheme to achieve a data transfer speed of 266 Mbytes/s. The companion I/O support chip offers four USB ports, which is twice that of most previous motherboard chips. It also provides an integrated LAN controller in addition to the more standard hard-disk, AC'97, SMBus, and PCI-33 interfaces.

As programs get more complex, the challenge to combine the application and graphics tasks becomes greater. Thanks to the close coupling of the CPU and graphics engine, the processor's designers had to provide better visibility into the program flow. To do that, the Timna includes more software debug hooks, which start with an on-chip logic analyzer with a three-level bus-triggering capability. Designers can use it to monitor the software flow and trigger on various internal bus events that normally wouldn't be visible from the outside. Also, a branch-trace message lets users save branches in a buffer that can be read offline through the JTAG test port. Programmers, then, can track the CPU code flow better.

Although the Timna processor will consume considerably less power than the Pentium 4, its 8- to 15-W power budget is way too high for the next generation of portable Internet appliances and many embedded and networking applications. Another Intel design team responded by revamping the 32-bit StrongArm architecture to craft the XScale microarchitecture. It includes a 7- to 8-stage superpipelined RISC processor that operates at up to 1 GHz while consuming from tens of milliwatts on standby to just a watt or two at full speed.

Fully compliant with the ARM 5TE instruction set, excluding the floating-point instructions, the XScale architecture supports the ARM 16-bit Thumb instructions as well as the full 32-bit instructions and integrated DSP instructions to better handle multimedia and data communications algorithms. When clocked at 1 GHz, designers estimate that the XScale architecture will deliver a raw throughput of about 1200 MIPS. At 400 MHz and powered by a 1-V supply, the processor should deliver about 500 MIPS while consuming less than 1 W.

The XScale architecture combines the CPU core with instruction and data memory management units and instruction and data caches as well as a smaller "mini" data cache. The instruction and data caches are 32 kbytes, while the mini data cache is 2 kbytes (Fig. 3). The memory support also includes write, fill, pend, and branch-target buffers. Surrounding the core is the support logic that implements the power management, performance monitoring, software and hardware debug, and I/O functions. Additional peripherals can be co-integrated with the XScale core to create a system-on-a-chip solution for handheld devices, low-cost countertop appliances, or even a high-performance network processor.

To support media processing, this next-generation core includes a multiply-accumulate coprocessor that performs two simultaneous 16-bit multiplications with 40-bit accumulations. Additionally, a 128-entry branch-target buffer keeps the pipeline filled with the most statistically correct branch choices. The mini data cache helps minimize data cache "thrashing" when data streams frequently change.

A performance-monitoring unit and a debug breakpoint block were added for hardware debugging. The monitoring unit provides two 32-bit event counters and a 32-bit cycle counter to help analyze hit rates and other operations. The debug block provides hardware breakpoints and a 256-entry trace history buffer to assist program flow analysis.

For more information about the processors, go to www.intel.com.

TAGS: Intel
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.