Games Flourish in a Parallel Universe

Gaming platforms such as Microsoft’s Xbox 360 and Sony’s PlayStation 3 push the proverbial envelope when it comes to graphics and computation, delivering sophisticated and realistic games. Thanks to their latest multicore 64-bit processing architectures, programmers are able to create sophisticated, multithreaded applications.

The computational processors are tightly integrated with the graphical processing units, minimizing system response time for a better gaming experience. Even small delays can disrupt the flow of a game or its multimedia presentation. Performance and balance on both the hardware and software fronts will provide an optimal gaming experience.

Gamers tend to grade a system on the basis of the game’s playing capabilities, regardless of how well it takes advantage of the underlying hardware. Still, looking under the hood shows each system’s potential. As with most programming platforms, applications rarely take full advantage of the hardware the first time around. It takes time to learn about system idiosyncrasies and to mold application frameworks to exploit the hardware.

Game developers have an additional challenge because game vendors often target multiple platforms with the same game. Obviously, this is desirable from a vendor’s perspective, because it widens the market. Unfortunately, even slight differences in platforms or their capabilities can significantly impact the software.

Differences between Microsoft’s and Sony’s platforms are quite substantial, so a seemingly minor problem potentially becomes major. The Xbox 360 uses a more conventional symmetrical processing (SMP) architecture. Sony’s PlayStation 3 is built on IBM’s Cell processor. The Cell foregoes the large caches for its eight Synergistic Processing Elements (SPEs), forcing application programmers to use softwarebased caching support.

THE SYMMETRICAL APPROACH Microsoft developed a multicore chip, with IBM, based on the Power architecture (Fig. 1). Its three 3.2-GHz processing cores are identical and have their own 32-kbyte L1 instruction and data caches. The two-way, set-associative caches include parity error checking on the 128-bit lines.

Each core can run two threads. The processing cores share a 1-Mbyte L2 cache, but this core has an interesting architecture. Half of the cache runs at the processors’ clock frequency, while the rest of the L2 cache runs at 1.6 GHz. Then, things become interesting when adding a new instruction called Extended Data Cache Block Touch.

The instruction is designed to prefetch data from main memory into the L1 cache. It’s often easier to take advantage of this instruction in a gaming environment, where the size and use of data is well-defined. Moving data into the cache reduces L2 thrashing, so it can be used to quickly build up a thread’s working set. In a conventional processor, the working set is brought in incrementally, slowing down the overall thread operation.

The processing chip accesses main memory through the frontside bus connected to the graphics chip. The front-side bus runs at 5.4 GHz with a bandwidth of 21.6 Gbytes/s. The graphics chip provides a unified memory system to the on-chip graphics processing unit (GPU) and the Power cores in the processing chip. The GPU can read data directly from the L2 cache for even better interaction with application code.

The processors also support cacheable and cache-inhibited store operations, which are handled by different pipelines. The cacheable operations use eight store-gathering, non-sequential buffers per core, while the noncacheable operations use four sequential buffers. By understanding these instructions, developers can optimize their applications.

For example, data written to main memory for use by the GPU will often benefit from bypassing the cache if the application threads no longer need to access this data. Running data through the cache would simply flush data that might be useful later. However, the cache isn’t the only concern for software developers.

Each processing core includes a VMX128 (Vector/SIMD Multimedia eXtension) unit. The VMX128 was specifically designed to accelerate 3D graphics and game physics. Developers can benefit from this feature because it was built on the VMX accelerator, which is already found in many Power architecture cores like those in Apple’s G4 and G5 Power Macs. Enhancing SIMD support in a compiler is a relatively straightforward process and typically allows a programmer to exploit the underlying hardware without significantly modifying the software.

There are significant advantages to Microsoft’s more conventional gaming hardware approach. SMP with multilevel, transparent coherent caches is standard fare on PCs. Thus, it’s significantly easier to develop multithreaded applications that will run on different platforms, often with minimal application architectural changes other than recompilation. The same is true for utilization of VMX 128, since this support is often hidden by the compiler.

Continued on Page 2.

WE DON’T NEED NO STINKIN’ CACHE The PlayStation 3 uses IBM’s Cell processor (Fig. 2). The Cell has a Power architecture core, called the Power Processor Element (PPE), similar to the ones used in Microsoft’s multicore solution (see “Cell Processor Gets Ready To Entertain The Masses”). But the Cell’s core is designed to manage a set of eight synergistic processor elements (SPEs).

The PPE is a Power architecture core that includes caching. It often runs a typical operating system like Linux. Applications running on the PPE coordinate the operation of the SPEs in addition to executing parts of the application that may not benefit from the SPEs’ multithreaded nature.

The Cell chip’s architecture and layout are very different from the Xbox chip, primarily due to the lack of caches on the SPEs. As such, designers are able to add 256 kbytes of RAM with each of the eight SPEs. This RAM is used for code and data storage.

The Xbox chip can run six threads on three processors, while each SPE is single-threaded. One big difference between the two approaches is that the SPE operation is deterministic, a feature not possible with a cache. Determinism is key in some environments, especially gaming.

There’s a tradeoff, though. Programmers or programming tools need to account for the different access times. Access to the 256 kbytes of local memory is the fastest. There’s a small overhead to access another SPE’s memory, but a lots of overhead to access main memory. In addition, DMA transfers between main memory and SPE’s memory are fast, but it takes time to set up the transfers.

The SPE architecture affects how applications are coded to take advantage of the multiple units as well as contend with the memory limitations. Overall, programmers take what Peter Hoffstee, Distinguished Engineer with IBM, calls a shopping-list technique when it comes to scheduling. SPEs are given jobs from a list and come back for another upon completion of their current job.

There are two approaches to deliver code and data from the list. The first essentially splits an application into chunks that will fit into an SPE. DMA is used to bring in the necessary code and working set data (Fig. 3). The SPE code may access other memory, but most of the data is loaded when the chunk starts. Data may be written back to main memory when the task is done. Often, the task runs to completion, and then the next chunk is loaded.

The chunking approach can be used to handle streams of data. For example, game programs typically process data for a display frame. This processing can be split into chunks, and the chunks are then distributed among the SPEs. A single frame may be broken up into more chunks than SPEs. Consequently, it’s simply a matter of running the chunks through the SPEs at a rate fast enough to complete a frame in time to display it.

The other approach is similar to chunking, but either the code or data stays in place. For instance, an application applies the same algorithm to a stream of data. The code is loaded once into an SPE, and then data is moved in and out as it’s processed. The flip side is a chunk of data that’s transformed by some code and then another and another. Doublebuffering reduces the amount of swappable data or code, but it may improve efficiency.

This swapping approach was quite common in the past when memory was at a premium. Think back to Fortran COMMON statements or Basic program switching on mini computers.

Code and data can be pushed into SPEs or pulled in by code running on the SPEs. Such software- based caching varies greatly from one application to another, but it gives the control to programmers instead of the hardware. Huge benefits can be derived from software caching and SPE communication if hardware is brought to bear on a problem. IBM optimized a ray tracing program that caches seven different kinds of data blocks among multiple SPEs. The end result: performance rose by almost an order of magnitude.