The Memory Wall Is Ending Multicore Scaling

1 of Enlarge image

Multicore processors dominate today’s computing landscape. Multicore chips are found in platforms as diverse as Apple’s iPad and the Fujitsu K supercomputer. In 2005, as power consumption limited single-core CPU clock rates to about 3 GHz, Intel introduced the two-core Core 2 Duo. Since then, multicore CPUs and graphics processing units (GPUs) have dominated computer architectures. Integrating more cores per socket has become the way that processors can continue to exploit Moore’s law.

But a funny thing happened on the way to the multicore forum: processor utilization began to decrease. At first glance, Intel Sandy Bridge servers, with eight 3-GHz cores, and the Nvidia Fermi GPU, featuring 512 floating-point engines, seem to offer linearly improved multicore goodness.

But a worrying trend has emerged in supercomputing, which deploys thousands of multicore CPU and GPU sockets for big data applications, foreshadowing severe problems with multicore. As a percentage of peak mega-floating-point operations per second (Mflops), today’s supercomputers are less than 10% utilized. The reason is simple: input-output (I/O) has not kept pace with multicore millions of instructions per second (MIPS).

The Memory Hierarchy And The Memory Wall

As far back as the 1980s, the term memory wall was coined to describe the growing disparity between CPU clock rates and off-chip memory and disk drive I/O rates. An example from the GPU world clearly illustrates the memory wall.

In 2005, a leading-edge GPU had 192 floating-point cores, while today’s leading-edge GPU contains 512 floating-point cores. In the intervening six years, the primary GPU I/O pipe remained the same. The GPU of six years ago utilized 16 lanes of PCI Express Gen2, and so does today’s GPU. As a result, per-core I/O rates for GPUs have dropped by a factor of 2.7 since 2005.

On-chip cache memory, which is 10 to 100 times faster than off-chip DRAM, was supposed to knock down the memory wall. But cache memories have their own set of problems. The L1 and L2 caches found on ARM-based application processors utilize more than half of the chip’s silicon area. As such, a significant percentage of processor power is consumed by cache memory, not by computations.

Cache control algorithms are notoriously application-dependent: sometimes a cache contains the data that an application needs, but sometimes it doesn’t. Rather than allowing cache control algorithms to make educated guesses about data locality, cache utilization could be improved, and cache power consumption lowered, if programmers controlled the data exchanged between cache and off-chip memory.

Today’s memory hierarchy includes four tiers: cache, DRAM, flash, and disk drives. Mainstream DDR3 memory runs at 4 Gbytes/s, while flash memory operates at 500 Mbytes/s. Disk drives deliver data at a pedestrian 100 Mbytes/s.

As we leave on-chip cache memory, not only does bandwidth decrease but latency also increases, further decreasing effective memory rates. Cache bandwidth of more than 20 Gbytes/s is only achieved when an application exhibits good data locality or reuse. Without data locality, caches can’t reduce memory bottlenecks.

Supercomputing used to distinguish between compute-bound and I/O-bound applications, with a somewhat arbitrary threshold at one operand per instruction. If an application required more than one operand per instruction, it was considered I/O-bound. If less than that, the app was compute-bound.

CPU and GPU computational abilities have far outstripped sustainable I/O rates by introducing features such as multiple cores, single-instruction multiple data (SIMD) registers, and multimedia accelerators. Such computational advances made most multicore applications I/O-bound and led to the 10% underutilized supercomputer rates mentioned earlier.

Architectural Approaches

Computer architects have updated direct memory access (DMA) techniques to reduce I/O bottlenecks. DMA is often supported in hardware, allowing memory accesses to operate simultaneously with CPU and GPU operations.

Example hardware DMA techniques include Intel’s I/O Acceleration Technology (I/OAT), Nvidia’s GPUdirect, and ARM’s AMBA DMA Engine. Recently, Nvidia added CudaDMA software to its CUDA toolkit. Using CudaDMA, application threads can perform I/O independently from computation. Similarly, supercomputing applications are starting to dedicate certain multicore cores exclusively for I/O (DMA).

Finally, enterprise software companies are deploying new parallel programming models such as Map/Reduce (MR), Google Sawzall, and Apache Hadoop. Ironically, the Map and Reduce steps exchange intermediate data using disk files, the slowest storage level in the memory hierarchy. MR and Hadoop threads execute at or near disk drives that contain the big data inputs and the intermediate output files.

Rather than building traditional supercomputers from three well-delineated but expensive subsystems (a storage server, a compute server, and a high-speed interconnect between the two), MR and Hadoop turn traditional supercomputing on its head by operating on locally available, distributed data. The reason is simple: data movement, not computation, consumes most of the power in supercomputing.

Moving “Smaller” Datasets

In addition to performing computations closer to the disk drives that hold the input datasets, some researchers (including me) are developing ways to reduce the size of datasets used in supercomputing and multicore processing. “Big data” from climate modeling, multi-physics experiments, genomic sequencing, and seismic processing are regularly in the terabyte range, so a 2x to 4x decrease in the size of these data structures would be economically significant.

Recent data reduction results from medical imaging and wireless infrastructure demonstrate that real-world x-ray datasets, ultrasound reflections, and 3G and 4G wireless signals can be compressed without changing the end result. Similar techniques are being applied to reduce the multicore memory wall for a broader set of datatypes, including floating-point values.

Taken together, retooled DMA techniques, novel programming models, and innovative data reduction techniques may substantially reduce the multicore memory wall, giving multicore users faster time to results.

Al Wegener, CTO and founder of Samplify Systems, earned a BSEE from Bucknell University and an MSCS from Stanford University.