Compression Reduces Memory Bottlenecks, From Supercomputing To Video

Applications like supercomputing and video need lots of bandwidth. Compression can reduce this requirement.

June 4, 2012

5 min read

In 1899, a writer for the popular magazine Punch’s Almanack famously wrote, “Everything that can be invented has been invented.” In 1943, IBM founder Thomas J. Watson allegedly said, “I think there is a world market for maybe five computers.” And in 1981, Bill Gates notoriously predicted, “640 kbytes ought to be enough for anybody.” Today we all laugh at these off-the-mark predictions, as they demonstrate the difficulty of predicting technology’s future.

Which of today’s technical prognostications might fall short? We engineers are confidently told about the end of Moore’s Law, the impossibility of automating software to extract parallelism, and (my personal favorite) the futility of expecting commercially useful results from new compression research. I’d like to refute that last prediction with two examples that illustrate how a novel compression technology is yielding big payoffs in two radically different applications.

Super Supercomputing

Supercomputing is facing the proverbial memory wall as increasing multicore CPU and GPU core counts strain already overloaded DDRx memory, PCI Express, and InfiniBand bus bottlenecks (see “HPC and ‘Big Data’ Apps Tap Floating-Point Number Compression”). Let’s first examine how the memory and bus bandwidth walls are challenging one of supercomputing’s key semiconductor components, graphics processing units (GPUs).

In the mid-2000s, Nvidia repurposed GPUs that it originally developed for gaming and video applications toward high-performance computing (HPC). Using Nvidia’s CUDA or OpenCL software frameworks, HPC algorithms coded in ANSI C can be parallelized to utilize hundreds of floating-point rendering engines per GPU. The industry calls this technology general-purpose GPU (GPGPU) computing.

GPGPU is used for such diverse HPC applications as finite element analysis, computational fluid dynamics, seismic processing, and weather forecasting. GPGPU technology has also become a key enabler in HPC’s push towards Exascale (1018 floating-point operations per second), supercomputing’s target goal for 2018.

To achieve Exascale performance, GPGPU technology must overcome daunting I/O challenges both to GDDR5 memory and PCI Express bus bandwidth. A 2008 DARPA report identified the need for a 16x improvement in memory bandwidth and a 100x increase in bus bandwidth if Exascale performance is to be achieved. Such order-of-magnitude I/O improvements will not be achieved by 2018 simply with evolutionary memory and bus standards, such as the 2x increase in memory bandwidth from DDR3 to DDR4, or the 2x increase in PCI Express bus bandwidth from Gen2 to Gen3.

Disturbingly, GPU memory and bus bandwidth per core has actually decreased by a factor of six since 2003, because the number of floating-point GPU cores rose by 24x (from 64 to 1536) while memory and bus rates only increased by a factor of four. Similar technology forces kept CPU per-core I/O rates flat since the mid-2000s for multicore CPUs like Intel’s Sandy Bridge and AMD’s Opteron. A novel compression approach that compensates for this missing I/O acceleration factor of 2x to 6x could accelerate struggling CPU and GPU interfaces that carry HPC’s floating-point operands.

Compression’s improvements to memory and bus bandwidth can be quantified. In 1948, Bell Labs researcher Claude Shannon succinctly described a theory of information content illustrating the tradeoffs between compression ratio and distortion using rate-distortion (RD) curves. RD curves succinctly quantify where compression algorithms reach their limits. However, only certain RD curves are affordable in real-time applications where compression and decompression must operate at tens or hundreds of megahertz. Thus, RD curves have an implicit third axis: complexity.

Consumer Compression

In today’s consumer electronics devices, video compression algorithms must simultaneously exceed a target compression ratio with acceptable distortion while implementation complexity meets a silicon or MIPS constraint. The H.264 compression algorithm is broadly used for video distribution across wireless networks, both for downloads of streaming video and uploads of user-captured video. H.264 compression exhibits excellent quality at 20:1 to 30:1 compression by using inter-frame correlations between groups of pixels, typically 16- by 16-pixel units called macroblocks (MBs).

During compression, a memory-intensive process called Motion Estimation (ME) compares the current MB to hundreds of MBs stored in previous frames. The most similar MB in previous frames (and also interpolated by a factor of four) is then encoded using a motion vector and the quantized transform coefficients of the MB error signal. The corresponding process during H.264 decompression is called Motion Compensation (MC).

For 1080p HD video displayed at 60 frames/s, H.264 DDR memory bandwidth requirements exceed 3 Gbytes/s. Rather than accelerating video signal processing operations, recent H.264 research has tried to optimize cache memory access to megabytes using special software or hardware techniques. As in supercomputing, I/O is now the critical bottleneck in H.264 video compression. Could the same I/O acceleration technology that provides a 2x to 6x speed-up for supercomputing be used to accelerate H.264 memory bottlenecks?

A new Samplify compression technique called APAX (“APplication AXceleration”) efficiently reduces memory bottlenecks for both supercomputing and video. For HPC applications, APAX floating-point compression reduces memory and storage bandwidth by 2x to 6x while generating the same results for climate modeling, weather forecasting, seismic processing, and 3D rendering. For video and graphics applications such as H.264 and rendering, APAX’s integer compression technology reduces frame buffer, texture, and mesh traffic across memory and bus interfaces by a user-selected factor, from lossless to 8:1, with visually imperceptible results.

In conclusion, an emerging compression technology that accelerates supercomputing’s push toward Exascale computing also accelerates video decoding, one of compression’s most well-researched and ubiquitous application areas.

About the Author

Al Wegener

Al Wegener is the CTO and founder of Samplify Systems, a fabless semiconductor startup in Santa Clara, Calif. He holds 17 patents and is named on additional Samplify patent applications. He earned a BSEE from Bucknell University and an MSCS from Stanford University.