|Download this article in .PDF format
This file type includes high-resolution graphics and schematics when applicable.
1. GPUs are just for rendering video and processing images.
As the name would imply, graphics processing units (GPUs) were originally designed to process and render images and video. Whether in the form of a discrete device or embedded alongside CPU cores on a single die, they are resident in every laptop, desktop, and gaming console. They are also used in their primary function within defense applications to render displays and compress captured video.
Approximately 15 years ago, engineers discovered GPUs could make great math acceleration engines based off the introduction of programmable shaders with floating-point support in GPUs. They realized this could be applied in situations where the data in motion moves much like images, and the required processing is well-suited for vector and matrix mathematical operations.
Subsequently, the term general-purpose graphics processing unit (GPGPU) was born, and languages such as CUDA and OpenCL were developed to abstract away the complexities of the underlying GPGPU architecture. As a result, developers could focus on the linear algebraic algorithms and digital-signal-processing functions.
2. GPGPUs are harder to program than CPUs.
For large algorithms that require hundreds of GFLOPS or even TFLOPS of processing power, CPU programmers often have to aggregate many multicore processors in a cluster of silicon devices. In embedded systems, this often means that a complex software infrastructure must be developed to coordinate direct memory access (DMA) transfers of data between processors, inclusive of asynchronous methods of data handoff signaling and transfer completion.
In large multi-stage DSP algorithms, this might require scatter-gather techniques between the stages, equating to significant complexity. However, some of the larger GPGPUs can boast benchmarks of 2 to 3 TFLOPS, where the entire algorithm is able to reside in a single piece of silicon. Transfer between processing stages is immensely simplified as data is always resident and coherent on the single die. Even inter-GPGPU transfers are simplified by abstracted functions such as NVIDIA’s GPUDirect. Finally, GPGPU programming languages such as CUDA and OpenCL continue to improve in terms of usability and popularity.
3. GPGPUs are hard to consider because they don’t stick around.
Despite the unparalleled performance of GPGPUs, often measured in FLOPS per watt, many defense programs have dismissed GPGPUs in the past for logistical reasons. The overwhelming concern has been longevity of supply. This is understandable given that commercial gaming GPUs often have a shelf life of only 18 months to two years. However, both NVIDIA and AMD have bolstered their embedded groups and offer device availability figures that are getting closer to the ideal 7+ years in the defense industry.
4. GPGPU applications typically run out of memory capacity or I/O bandwidth before they run out of TFLOPS of compute.
GPGPUs such as the NVIDIA Maxwell GM204 on the Tesla M6 MXM module can realize between two to three TFLOPS. Some algorithms, such as a fast convolution with a half-million point fast Fourier transform (FFT), are able to make use of that kind of compute power. However, the problem usually resides in that data can’t get to all of those GPGPU cores fast enough. Perhaps there isn’t enough external GDDR5 memory resident to the GPGPU to buffer ingested sensor data, or perhaps there isn’t enough bandwidth in terms of PCI Express coming into and/or exiting the GPGPU. Exceptions always loom, but in general, sensor processing applications often find themselves memory-bound or I/O-bound as opposed to being compute-bound.
5. Cooling large GPGPUs is an insurmountable problem within the operational environment of defense applications.
Fortunately, as is the case with CPUs and FPGAs, GPGPUs come in different sizes with scaling thermal-design-power (TDP) figures to match. With that said, certain defense applications demand the largest GPGPUs possible on the VITA 46 VPX form factor.
Major enhancements have been made to more traditional board-level cooling techniques, such as the VITA 48.2 design guidelines for conduction cooling. For instance, one recent COTS product places two NVIDIA Tesla M6 MXMs, 75 W each, on a conduction-cooled 6U VPX carrier. That may represent the upper limit with conduction cooling. But cooling techniques such as VITA 48.5 Air Flow-Through and liquid-cooling techniques will allow for even higher thermal-dissipation ceilings in VPX, in addition to the incorporation of GPGPUs with even higher TDP numbers.
6. GPGPUs are strictly constrained to streaming processing.
Given GPGPUs were originally designed for graphics, they were relegated to applications where the data was streaming. In other words, there was no way to have shared or static data, and each GPGPU core could only “tap into the stream”: reading data from an ingress, perform some function in real time, and write to egress.
GPGPUs are now being built with multi-level caches and branching functionality. However, the clear GPGPU performance strength still lies in processing large data sets that are highly parallel and have minimal dependency between data elements in time. Therefore, branching should be kept outside of inner loops.
7. GPUs can’t handle recursion, irregular loop structures, or decision branches.
With dynamic parallelism, GPU kernels launch other kernels, optionally synchronize on kernel completion, perform device memory management, and create and use streams and events. The launching kernel is the “parent,” and the new grid it launches is the “child.” Child kernels may themselves launch work, creating a “nested” execution hierarchy. All child launches must complete before the parent is complete. These advances allow for hierarchical algorithms to be written, where the data from a previous generation is used to calculate the partition of work on the next lower level.
GPUs have the standard branching commands such as “if,” “if-else,” and “for.” A warp is a group of 32 threads and a warp must execute one instruction at a time. Because of this, the best efficiency is achieved when all of the threads in a given warp take the same execution path.
8. GPU kernels can only be launched from the CPU.
Originally, the CPU performed a sequence of kernel launches and each kernel needed enough parallelism to efficiently offset the calling and memory passing overhead. Now, child kernels need to be called from within a parent CUDA kernel. The parent kernel can use the output produced from the child and optionally synchronize it, all without CPU involvement. This is simple to adopt because the GPU kernel launch uses the same syntax as those from the CPU.
CPU controlled programs are limited by a single point of control that can only run a limited number of threads at a given time. The CPU is consumed with controlling the kernels on the GPUs. By transferring the top-level loops to the CPU, thousands of independent tasks can be run, which releases the CPU for other work.
9. Memory must be created by the CPU and then passed to the GPU to move data and receive results.
Unified Virtual Memory (UVM) provides a single address space for CPU and GPU memory, which eliminates system memory and copy overhead. This unified memory space can also be shared by multi-GPGPUs. By allowing the user to determine the physical memory location from a single pointer variable, the libraries can simplify their interfaces. Before UVM, separate commands were needed for each type of memory copy; CPU to GPU, CPU to GPU, etc. Now one function call can handle all cases and the user is freed to specify source and destination memory space. Zero-copy memory is truly achieved with UVM.
10. GPUs need a CPU to handle data ingest and to pass data between GPUs.
Using GPUDirect, multiple GPUs, third-party network adapters, solid-state drives (SSDs), and other devices can directly read and write GPU to GPU. GPUDirect Peer-to-Peer enables high-speed DMA transfers between the memories of two GPUs on the same PCIe or NVLink bus. A GPU kernel can directly access (load/store) into the memory of other GPUs with data being cached in the target GPU.
GPUDirect RDMA is an application programming interface (API) between the InfiniBand CORE and the GPUs. It gives access to a Mellanox Host Channel Adapter (HCA) to read/write the GPU’s memory buffers, resulting in GPUs being able to ingest data directly from an RDMA interconnect without the need to first copy data to the CPU memory.
Using only GPUDirect RDMA, the CPU is still driving the computing and communications. By adding GPUDirect Aysnc, though, removes the CPU from the critical path. The CPU prepares and queues the compute and communication tasks on the GPU. The GPU then signals the HCA, and the HCA directly accesses the GPU’s memory.
11. Using MPI with GPUs is slow.
With a regular MPI, only pointers to host memory can be used. When combining MPI and CUDA, GPU buffers need to be sent, and that requires staging of GPU buffers in CPU memory. With a CUDA-aware MPI library, the GPU buffers can be directly passed to MPI, which enables message transfers to be pipelined and acceleration techniques such as GPUDirect to be transparent to the user.
CUDA’s Multi Process Service (MPS) is a feature that allows multiple CUDA processes to share a single GPU context. Each process receives some subset of the available connections to that GPU. MPS allows for overlapping of kernel and memory operations from different processes on the GPU to achieve maximum utilization for legacy MPI applications. No application modifications are needed—just start the MPS daemon. By setting the GPU in exclusive mode, MPS can run in a multi-GPU system.