Compression Reduces The I/O Bottlenecks Of Multicore Computing

1 of Enlarge image

Fig 1. Moore’s Law is exceeding I/O limits.

Fig 2. Multicore has compute and I/O limitations.

Fig 3. Compression and decompression can reduce the required bandwidth for almost any interface.

Fig 4. Prism FP compresses floating point data using lossless, fixed rate, or fixed quality requirements. Translation takes under eight clocks.

Fig 5. Lossy compression creates decompressed data that still provides sufficient detail for images and data processing.

Multicore central processing unit (CPU) and graphics processing unit (GPU) processors are used in an astounding variety of computational platforms, including supercomputers, desktop and laptop PCs, and smart phones. All of these platforms are taking advantage of inexpensive compute fabrics built with thousands of multicore sockets.

However, per-socket input-output (I/O) becomes increasingly limited as tens of CPU cores and hundreds of GPU cores compete for finite per-socket memory and bus bandwidth. If I/O problems could be solved by adding gates, rather than pins, multicore I/O could then also be accelerated by Moore’s Law, a doubling of transistor density roughly every 18 months, rather than being constrained by Amdahl’s Law, which states that the degree of parallelization of a computational task is limited to the part of the task that cannot be parallelized.

Figure 1 illustrates the interface gap between computational rates and I/O rates, using five generations of Nvidia GPUs as relevant examples. The Nvidia GeForce 8800GT has 112 floating-point cores and supports I/O rates of up to 20 Gbytes/s. The Nvidia Quadro Fx580 contains 256 improved floating-point cores and supports I/O rates of almost 36 Gbytes/s.

In this specific example, a GPU core’s computational performance increased by three times, while its I/O rate only increased by 1.8 times. Computational elements scale with Moore’s Law, while I/O scales with package size. As more cores are being added to CPUs and GPUs, package pin counts are falling ever farther behind.

Figure 2 illustrates both the multicore benefit and the multicore I/O challenge. The “one-core profile” shows the duration of I/O versus the duration of computation using a single core.

N cores can solve this hypothetical application’s computations up to N times faster, assuming the computation does not run up against Amdahl’s Law—if 10% of a task must be performed sequentially, the task cannot take advantage of more than 10 cores. Finally, Figure 2’s lower bars illustrate the improvement from I/O compression and decompression, which speeds up the I/O-related part of the task.

Specific I/O Bottlenecks

CPU and GPU vendors regularly try to overcome I/O restrictions by increasing interface bandwidths and signaling rates. As of 2011, CPU DDR3 memories can sustain 18 Gbytes/s across 64 pins, and GPUs boast GDDR5 memories with more than 120 Gbytes/s across 384 pins. Similarly, optical networks are now deploying InfiniBand links at 40 Gbits/s, while 10-Gbit/s Ethernet, already widely deployed in the Internet infrastructure, is now moving toward 100 Gbits/s. With this much available bandwidth, how can there possibly be a multicore I/O problem?

The problem lies in the “pins per core” limitation. When single-core Intel CPUs first reached 3 GHz in 2004, an Intel socket had about 800 balls, with 500 for power and ground and 300 for I/O. In 2004, the DDR2 memory interface delivered about 5 Gbytes/s, with PCI Express and Ethernet providing an additional 4 Gbytes/s.

In contrast, today’s Intel Sandy Bridge processor has a 1155-pin package, where roughly 500 pins are dedicated for I/O. However, all four Sandy Bridge cores share dual DDR3 controllers that operate at 1.33 Gtransfers/s across 204 pins per dual-inline memory module (DIMM), providing a maximum transfer rate of 21 Gbytes/s.

A 16-lane general-purpose PCI Express interface provides 8-Gbyte/s interface bandwidth. Per core, the aggregate Sandy Bridge DDR3 data rate is just above 5 Gbytes/s, which is no faster than the per-core memory rate in 2004. And, Sandy Bridge’s PCI Express bandwidth has only doubled since 2004.

The I/O problem is similarly challenging for the latest generation of Fermi GPUs, whose 448 cores share a 16-lane PCI Express Gen2 interface of 8 Gbytes/s and a 384-pin GDDR5 memory interface of 120 Gbytes/s. On average, each Fermi core can exchange just 256 Mbytes/s with GDDR5 memory and less than 18 Mbytes/s per core across PCI Express.

New memory and interface standards won’t make much difference. DDR4 memory with its 45 Gbytes/s won’t become mainstream until 2015, while PCI Express Gen3 will only double PCI Express Gen2 bandwidth starting in 2012. For both CPUs and GPUs, I/O rates continually fall behind the geometrical rise in the number of cores per socket.

A New Idea: Numerical Compression

Now that we’ve described why improving multicore I/O is important, let’s consider how compression and decompression might be integrated into existing multicore designs to help reduce this bottleneck. Figure 3 illustrates several locations where compression (“C” blocks) and decompression (“D” blocks) could be added to a multicore CPU to reduce various I/O bottlenecks.

Also, Figure 3 illustrates a generic six-core CPU whose cores are connected to each other, and to a memory and peripheral interface subsystem, via a high-speed, on-chip ring. The front-side bus block includes two DDRx memory controllers (up to 18 Gbytes/s per DDRx DIMM) and at least 16 lanes of PCI Express Gen2 (up to 8 Gbytes/s). CPU compression and decompression could be added in at least three on-chip locations:

at each core-to-ring interface
to each DDRx memory interface controller
to the PCI Express interface controller

Conceptually, every transaction that could be compressed should be invoked as each core writes data to any other CPU, to off-chip memory, or to off-chip peripherals. Compressed data would be decompressed just before the data is delivered to each CPU from other CPUs, from off-chip memory, or from off-chip peripherals.

Let’s get specific about what kind of data could be compressed by the generic compress and decompress blocks. CPU and GPU vendors have already added dedicated compress and decompress accelerator intellectual property (IP) blocks to their existing chips. What’s unique about the “C” and “D” blocks in Figure 3?

First, existing compression accelerators are limited to compressing or decompressing consumer media. These existing blocks accelerate well-known compression algorithms such as MP3 for music, JPEG2000 for photos, and H.264 for video. However, these algorithms can only compress or decompress audio, image, and video files.

In contrast, the “C” and “D” blocks are designed to process any numerical data—integers and floating-point values. Since all CPUs and GPUs include dedicated hardware for numerical processing, it seems reasonable to conclude that numerical data makes up a significant part of CPU and GPU workloads.

In fact, given the increasing interest in cloud computing, the computational loads of modern CPUs and GPUs are tilting towards numerical processing applications. Hybrid CPU + GPU chips, such as Intel’s Knights Ferry and AMD’s Fusion, are aimed at high-end servers and high-performance computing (HPC), both of which perform lots of number-crunching.

Given Nvidia’s and AMD/ATI’s marketing efforts to grow the use of GPUs for low-cost, high-speed computation, the GPU market is increasingly focused on numerical computing for non-graphics applications. With GPU computing, Nvidia and AMD offer a “supercomputer on a desk” with Cray-like compute capabilities at the cost and power consumption of a desktop PC.

In summary, multicore numerical compression and decompression reduces I/O bottlenecks for a large and growing class of applications that include not only media, graphics, and video, but also HPC and cloud computing.

Existing Compression Algorithms

Consumer compression algorithms for speech, audio, images, and video have seen wide deployment and great success, but they only operate on integer datatypes. Also, such compression algorithms are signal-specific. You can’t use MP3 to compress an image, just as you can’t use H.264 to compress speech. Speech, audio, and video compression algorithms are too specific to be useful for general-purpose numerical compression.

In HPC and cloud computing, 32-bit single precision and 64-bit double precision floating-point values are the most common data types, so a numerical compressor for I/O would have to compress floating-point values. While there’s been some academic research about lossless compression of floats, the results are underwhelming (not much compression and slow execution speed), and these algorithms don’t support lossy compression.

Finally, while lossless compression algorithms such as WinZIP, bzip, and PKzip are appropriate for byte-oriented character strings, such as those found in text documents and some spreadsheets, they usually achieve less than 10% compression on numerical data. ZIP-type algorithms also don’t offer lossy modes.

Ideal Characteristics

If we could design the ideal numerical compressor, what features would it include? First, it would compress and decompress 8-bit, 16-bit, and 32-bit integers, as well as 32-bit and 64-bit floats. The compressor would offer lossless compression for those applications that simply can’t tolerate any mismatches between the original and the decompressed numerical stream.

But when higher compression is needed, the algorithm would offer a fixed-rate mode with a continuous range of user-selectable compression ratios and a fixed-quality mode with a continuous range of user-selectable dynamic ranges and/or maximum error choices. By the time that most users start using the numerical compression algorithm, it should have been proven in many numerical processing applications, so the choice of “good enough” compression ratio or compression quality would readily be available.

The algorithm should be available in software (for flexibility) and in hardware (for speed). Integrating the compression and decompression steps should be simple, perhaps by replacing existing file I/O function calls such as fwrite and fread with their compressed equivalents, or perhaps for HPC as message-passing interface (MPI) calls such as MPI_C_SEND (compress and send) and MPI_RECV_D (receive and decompress).

An algorithm called Prism FP already has most of these characteristics. The only feature that Prism FP doesn’t yet support is integration into CPUs and GPUs.

A Numerical Compressor

The Prism FP compression algorithm accepts integer and floating-point values at rates up to 3 GHz (Fig. 4). Its fixed-rate mode allows users to specify the desired compression ratio (such as 2:1 or 3.45:1) in increments of 0.05. Its fixed-quality mode allows users to specify the dynamic range to be maintained, in 0.5-dB increments, such as 83 dB or 142.5 dB.

Prism FP then generates compressed packets that consist of a small header (2 to 4 bytes) and the compressed numerical payload. Depending on the format of the numerical input, certain fields may be compressed independently, such as floating-point sign bits and mantissa bits.

Also, Prism FP offers user-selectable complexity settings that trade off algorithm complexity and compression performance. On a 3-GHz single-core CPU, the lowest complexity level (fastest compression performance) operates at 300 Mfloats/s in software, while its highest complexity level (best compression quality) operates at 50 Mfloats/s in software. Software and hardware versions are totally compatible. In advanced process geometries, Prism FP compression and decompression will operate at 2 Gfloats/s or faster.

Prism FP achieves a range of compression results on a variety of integer and floating-point datatypes, including typical lossless compression ratios between 1.5:1 and 2:1 and typical lossy compression ratios between 2:1 and 4:1 (see the table).

Many of the integer-valued compression examples in the table represent samples from analog-to-digital converters (ADCs), where FPGA implementations compressed tens to tens of thousands of ADC channels in real time. Real-time compression for ADCs dates back to 2004 and is consistent with a hardware IP block for floating-point compression in future CPU and GPUs.

Apply Lossy Compression For Numerical Data

Compression algorithms come in two fundamental flavors: lossless and lossy. Lossless compression of computer text files began in the 1970s, when Abraham Lempel and Jacob Ziv created and patented the famous Lempel-Ziv lossless compression algorithms LZ-77 and LZ-78. Lossy compression of speech signals in telephony dates back to the 1960s.

Lossy compression may cause some trepidation in the hearts of prospective users because they fear something meaningful will be lost. But people have become more comfortable with lossy compression, simply because it works. Very few consumers ever modify the default lossy compression settings of their iPod and MP3 music players, or select a different lossy compression setting for their digital camera, because the resulting songs and photos have acceptable quality.

It wasn’t always that way, though. The professional audio community initially refused to use audio compression when it was introduced in the early 1990s, because its members saw themselves as the keepers of audio quality.

But professional audio started using compression by the mid-1990s because not only was the audio quality judged “good enough,” users also got hooked on the benefits of compressed audio, like faster downloads on slow 56-kbit/s dialup modems and the ability to store an individual’s entire music library in compressed format on a USB thumb drive.

Most consumers accept lossy compression of consumer media (audio, images, and video) because the convenience of handling compressed media is coupled with a “good enough for me” rating of sound and image quality.

Given the adoption cycle that compression always faces in every market it has entered, Prism compression for high-speed integer data (such as from ADCs) has faced some challenges. But just a few years after Prism compression’s introduction to the medical imaging, wireless infrastructure, and high-speed sensor markets, the comfort level of engineers with lossy compression for these markets is quite good.

Once users realize that Prism compression gives them a choice of compression ratios and control over signal quality, they are relieved because they control the rate-distortion tradeoff. Giving users control over quality makes them feel more comfortable with compression.

The final column of the table lists a quality metric whenever lossy compression is used. Every numerical application listed in the table has a number of ways to quantify signal quality. For instance, in 3G and 4G wireless, a metric called error vector magnitude (EVM) quantifies the quality of demodulated signals.

Wireless infrastructure manufacturers already measure EVM when they qualify their equipment. As wireless basestation manufacturers evaluated Prism lossy compression, they realized they didn’t need to develop any new signal quality metrics. They simply had to monitor the effects of Prism compression on their existing compression metrics.

Having a methodology or framework that quantifies the effects of lossy compression is critical for the adoption of numerical compression in HPC and cloud computing. While HPC and cloud computing users may initially be nervous about using lossy compression, they should also remember the compelling benefit of lossy compression: a controllable way to reduce ever-worsening multicore I/O bottlenecks.

HPC and cloud computing users can use lossless compression mode and still gain a benefit, even if it’s just 1.5:1 compression. After all, DDR3 memory isn’t ever going to get 50% faster, but 1.5:1 lossless compression delivers operands 50% faster. And in many I/O-bound HPC and cloud computing applications, it’s clear how a 3:1 or 4:1 speed-up of disk drive or PCI Express or DDRx memory bandwidth improves “time to results.”

Because lossy modes guarantee the compression ratio, or guarantee a “not to exceed” error tolerance, HPC and cloud computing users can slowly ease into lossy compression, knowing that they control the rate-distortion tradeoff.

Compression Results

Memory-bound HPC applications in pattern recognition, financial modeling, and 3D graphics already utilize less than 20% of the available multicore MIPS. We now examine the effects of compression on k-means clustering and Black-Sholes models taken from the Nvidia CUDA software development kit (SDK). The third example, compression of 3D wireframe models whose vertices are specified in floating-point format, was obtained from a 3D graphics Web site.

The k-means clustering algorithm is a general-purpose way to group large datasets into a small number of clusters. It is used in applications as diverse as fingerprint analysis, market segmentation, and computer vision. Figure 5a illustrates typical results from a k-means clustering experiment, where thousands of 2D observations (the multi-colored dots in Figure 5a) are assigned to three 2D clusters (the three ovals in Figure 5a).

The upper k-means solution (three ovals) can be characterized by the (x, y) center of each oval and the oval’s x and y axis length. The upper graph in Figure 5a is the original three-cluster result, while the lower graph shows the three cluster decisions after compressing and decompressing all of the 2D 32-bit floating-point input observations by a factor of 2.5:1.

Even though the decompressed floats are not identical to the original floats, the k-means clustering algorithm still identified the same three oval locations (clusters), to six significant decimal digits, an error of less than 0.000169%. This level of error is insignificant compared to the four-digit accuracy with which the 2D measurements were taken. Thus for k-means clustering, compression at 2.5:1 had no effect on k-means cluster decisions.

Black-Sholes methods are widely used to price commodities options for financial markets. Black-Sholes models are usually I/O-bound and thus would benefit from compression. At first glance, Black-Sholes seems an inappropriate candidate for lossy compression because it generates financial data. However, Black-Sholes modeling also reflects the imperfect, statistical nature of several input parameters, including stock volatility, interest rate risk, and liquidity risk.

To accurately model these imperfectly known input parameters, Black-Sholes randomizes the input parameters of the financial model. This randomization feature makes lossy compression suitable for Black-Sholes modeling. The specific Black-Sholes model we examined priced 256 different options and predicted when (in months) the option should be exercised. After 2:1 compression, the error in Black-Sholes option pricing and exercise date varied less than 0.004% when compared to the original results.

Computer graphics often begin with wireframe meshes of 3D objects, which are then interpolated and textured in a graphics pipeline before being displayed on a computer monitor. The 3D wireframe coordinates are represented as 32-bit floating-point values. 3D graphics are sometimes limited by the vertex I/O rate at the input to the graphics pipeline, so compression could increase the vertex rate.

But if the vertices are compressed by 2.75:1, would the visual effect of such compression be visible in the final 3D object? The upper image in Figure 5b illustrates the original Space Shuttle rendered at a given orientation, while the lower image illustrates the rendered Space Shuttle at the same orientation, but after compressing its vertices by 2.75:1. The differences between the upper and lower images in Figure 5b certainly appear to be minor.

Our subjective visual comparison of Figure 5b’s images is corroborated by an objective metric called Structural Similarity (SSIM), which quantifies the difference between two images. SSIM scores have a maximum value of 1.0, and an SSIM score above 0.97 is often considered visually indistinguishable. The two images in Figure 5b have an SSIM score of 0.99. At a 2.75:1 lossy compression, the throughput of 3D vertices in graphics pipelines can be increased without affecting the quality of the rendered image.

Compression may soon be offered in multicore CPU and GPU hardware as a way to accelerate I/O for many numerical processing applications. Because of its flexible blend of lossless and lossy compression, as well as its efficiency as a low-complexity algorithm, Prism FP is a good candidate for hardware implementation. Adding high-speed compression into the numerical computing pipeline would give HPC and cloud computing users a way to improve the throughput of numerical processing applications while still generating the same results, perhaps faster than ever before.