Electronic Design

Designers Of Next-Generation Game Systems Aim For Component Balance To Eliminate Processing Bottlenecks

Multithreaded CPU anticipates game engine utilization.

Last month, at the German Games Convention in Leipzig, Microsoft Corp. stated that it anticipates next-generation game/entertainment console Xbox 360 to hit North America, Europe, and Japan in time for the holiday season. Pricing will begin at $299.99. At this event, Microsoft revealed the first details of the new product's design. Immediately following the Games Convention, Microsoft provided even more details at the Hot Chips 17 conference at Stanford University, Calif.

At Hot Chips, Microsoft unveiled architectural details of the key chips that have been developed for its Xbox. More recently, Infineon released details of three additional chips it has developed with Microsoft for the Xbox 360. Also at Hot Chips,Toshiba researchers described a critical audio/video I/O chip expected to surface in Sony's Playstation 3 when it debuts sometime next spring. The SCC, a "super companion chip" with A/V interface, works with the so-called Cell Processor. Unveiled at ISSCC earlier this year, the Cell Processor is a highly parallel compute engine, developed jointly by IBM Corp., Sony, and Toshiba.

Keys to the success of both systems are: breadth of functionality, quality of video and audio, and ease of programmability. Both the Xbox and PS3 promise to deliver the most realistic and highest-performance game graphics to date, rivaling the best graphics workstations but at consumer costs.

"While the (Xbox 360) has the muscle to power awe-inspiring graphics, audio, and online play thanks to its 720p/1080i (HDTV) output, a 16x9 cinematic aspect ratio, anti-aliasing for smooth textures, and full surround sound, it also has the intelligence to serve as an all-in-one entertainment device that plays CDs, DVDs, MP3s, and digital content from an array of devices, such as portable music players and digital cameras," said Robbie Bach, Microsoft's chief Xbox officer.

Xbox product group VP, Todd Holmdahl, added that Xbox 360 designers focused on achieving a balance among system components, rather than aiming for great theoretical performance in any one component. The latter approach could mean lots of performance left on the table as a result of system bottlenecks.

"Right from the start, we built the system as the hardware instantiation of our software vision-that systems should be extremely powerful when running the type of software that game developers build," said Holmdahl. "This meant that we had to look very carefully at where bottlenecks happen when running game systems, and project what this would look like in the next generation. We analyzed hundreds of Xbox 1 games in detail, and we have great insight into the future due to the innovations that happen on our Windows platform."

The Xbox 360 CPU chip consists of three symmetric Power PC cores running at 3.2 GHz. Each core includes data and instruction caches of 32 kbytes each, and the chip has a shared 1-Mbyte L2 cache that is eight-way set associative. Additionally, each core includes a customized vector floating-point math unit (the VMX128), and can run two program threads (Fig. 1). Holmdahl noted that the key here is the six hardware threads available. "This matches where we predict game engine utilization will peak," said Holmdahl. The symmetric nature of the cores and the shared cache makes it easier to move code between the cores, as well as implement communication and synchronization. "Multi-threaded simulation code is hard to write," he said, "and using elaborate DMA controller and DSP-style programming, as in previous generations, complicates developers' lives and is wasteful of their time."

The CPUs can provide high-bandwidth data streaming support with minimal cache thrashing by using a 128-byte cache line size. A tight data streaming capability between the CPU and the graphics processor unit (GPU) is also available. The GPU can read 128 bytes at a time from the L2 cache and it provides low-latency cacheable writebacks to the CPU. The GPU also shares Direct3D (D3D) compressed data formats with the CPU to at least double bus bandwidth for graphics data.

A 5.4-Gbit/s front-side bus offers 10.8-Gbyte/s read and 10.8-Gbyte/s write speeds. Supporting the custom CPU is a 500-MHz GPU developed in conjunction with ATI Technologies (Fig. 2). The GPU includes 48 parallel unified shaders and 10 Mbytes of embedded DRAM (a custom memory chip co-located in the same package as the GPU) for 256-Gbyte/s frame buffering. Rounding out the system hardware support are 512 Mbytes of 700-MHz GDDR3 memory with a 22.4-Gbyte/s memory interface bus bandwidth; a 12X, dual-layer DVD; and a 20-Gbyte hard drive that's standard on the premium model and an option for the core version. The system employs a unified memory architecture that allows the 512 Mbytes to be shared between the CPUs and the GPU with no partitioning headaches for the programmers.

Support from Infineon includes a removable solid-state memory unit, a single-chip wireless game-pad controller, and an advanced security chip. In the memory unit, Infineon crafted a custom memory controller and developed the embedded software that runs on the chip. For the security chip, the company created a custom solution based on its proven authentication technology. The game-pad controller eliminates the cables needed for game play, thus freeing the players from the tethers of wired controllers.

According to Holmdahl, most game/entertainment systems have a significant limiting factor, such as a lack of sufficient memory to feed the CPU and GPU, not enough bandwidth between system components, or an inability to apply processing resources automatically to a changing load profile. "Those and other limitations and restrictions have been dealt with in the design of Xbox 360," Holmdahl asserted. "The Xbox 360 was designed from the ground up to map to exactly what game developers told us they needed."

Holmdahl cited the GPU as one example, with its improved raw shader power over earlier Xbox systems. The 48 parallel shader ALU pipelines are dynamically scheduled and up to 24 billion shader instructions/s can be executed by the pipeline. "This means that the GPU dynamically adjusts to the incoming stream of work that the gaming program is sending it. It also allows the GPU to dynamically schedule around events, such as waiting for data from memory, that would normally block ALUs from processing. GPU resources are kept as active as possible without programmer intervention." The pixel fill rate can hit 4 billion pixels/s (eight per cycle) and the geometry engines can deliver up to 500 million triangles/s. Texture fills can be done at up to 8 billion bilinear filter samples/s.

Moreover, the GPU can optimize vertex or pixel shading automatically for each game without the developer having to write extra code. The embedded DRAM and high-speed frame buffering effectively provides free anti-aliasing to eliminate the jagged edges characteristic of earlier game systems. "The embedded DRAM is unique to Xbox 360 and can process information about 40 times faster than Xbox 1, while allowing for roughly ten times more bandwidth than the competition," Holmdahl asserted. "With the special logic on the memory die we can process the information there without having to send it back to the core GPU. This means no signal aliasing, as well as the ability to process a great deal of information quickly, resulting in more pixels, faster, and the delivery of crisp, realistic images. The GPU also has a tesselator to provide smooth, round silhouettes."

"A traditional memory architecture will get 'fill-bound' as it tries to read and write to memory to build the scene," Holmdahl explained. "This is especially true of next-generation visuals that require high scene complexity, high dynamic range, HD resolutions and anti-aliasing."

According to Holmdahl, balance among system components can create more time for developers to spend focusing on building their game content, rather than trying to constantly re-factor their game systems to work with poor architectural balance. "We believe we have selected the right balance between floating point and integer instruction performance," he continued. "About 80% of game code is integer, general- purpose, code. So it makes sense to set the CPU balance towards this level and then skew to account for increased simulation processing."

Expected to drive Sony's Playstation 3, the Cell Processor will support concurrent real-time and conventional computing applications operating at frequencies beyond 4 GHz and capable of delivering single-precision compute throughput of over 256 GFLOPS.

The Cell-Processor chip is based on a 64-bit IBM Power Processor Element (PPE) that performs all control/coordination operations for the rest of the chip, and up to eight high-performance streaming processors called Synergistic Processor Elements (SPEs). The PPE implements the power instruction-set-architecture, but has a leaner microarchitecture than previous implementations. It can run multiple operating systems (including Linux) and two simultaneous program threads. Including 32-kbyte L1 data and instruction caches, and a double-precision floating-point multiplier, the PPE is supported by an on-chip 512-kbyte L2 cache.

Each of the eight SPEs has its own local memory and can run an independent instruction stream. All SPEs connect to the PPE and the rest of the chip via a high-bandwidth quad-ring bus called an Element Interconnect Bus (EIB). The four 16-byte-wide data rings that make up the EIB can transfer up to 96 bytes per cycle. Ten simultaneous instruction threads are possible: the dual threads that run on the PPE and the eight threads possible on the eight SPEs. The PPE and SPE combination can also handle over 128 outstanding memory requests.

SPEs represent the first implementation of a new processor architecture designed to accelerate media and streaming workloads. Optimized for power efficiency and area, the SPEs are well suited for multicore implementations that can take advantage of parallelism. Load and store instructions for the SPEs are performed within a local address space served by a Local Store (LS) memory (256 kbytes) that's attached to each SPE. A 128-bit data bus (16 byte) connects the LS memory to each SPE.

Supporting the audio and video inputs and outputs, the SCC described by Toshiba at Hot Chips has an architecture optimized to deliver a high quality of service. It handles resource bandwidth allocation via a bus arbitration mechanism with priority in every cycle (Fig. 3). Internally, it employs two buses. One processes video and audio streams in real time and the other handles "best-effort" processing for data movement to storage and I/O ports, such as USB and Gigabit Ethernet interfaces.

The chip also includes management logic to prevent conflicts between or among operating systems, and it provides content security via a hardware random number generator and multiple encryption/decryption provisions. The SCC includes a DDR2 DRAM interface for video RAM, with a dedicated DMA controller for streaming data. Also on the chip are high-definition and standard-definition video and audio inputs and outputs, an IEEE 1394 (FireWire) interface for connecting digital A/V equipment, and a transport stream interface for a digital tuner. Connectivity options include PCI, PCI-Express, and USB 2.0, a Gigabit Ethernet high-speed network interface, and a parallel ATA interface for connecting storage devices.

Also in development for the PS3 is an advanced graphics processor that Nvidia is basing on its GeForce 7800 GTX architecture. Dubbed the RSX Reality Synthesizer, it can perform 136 shader operations per second with its multiple programmable shading processors. Pixel resolution will be 128 bits for exceptional image creation and dynamic range. Furthermore, the chip will deliver a screen resolution of 2k by 1k pixels-the highest high-definition video resolution defined by the HD standard.

Nvidia has implemented a fixed-function texture mapping architecture and claims the chip will be the most advanced GPU implemented to date. It will have even power than two of its GeForce 6800 GPUs. Packing over 300 million transisitors, the RSX will be implemented in a 90-nm 8-level metal process. Supporting the RSX GPU will be 512 Mbytes of render memory. The RSX GPU combined with the Cell Processor will deliver a compute throughput of 2 TeraFLOPS.

Need More Information?
ATI Technologies Inc.
IBM Corp.

Infineon Technologies AG

Microsoft Corp.

Nvidia Corp.

Sony Corp.

Toshiba Corp.

TAGS: Toshiba
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.