Programming The CUDA Architecture: A Look At GPU Computing

Graphics processing units (GPUs) were originally designed to perform the highly parallel computations required for graphics rendering. But over the last couple of years, they’ve proven to be powerful computing workhorses across more than just graphics applications. Designed with more resources devoted to data processing rather than flow control and data caching, GPUs can be leveraged to significantly accelerate portions of codes traditionally run on CPUs, ranging from science and engineering applications to financial modeling.

CUDA is Nvidia’s parallel computing architecture, which manages computation on the GPU in a way that provides a simple interface for the programmer. The CUDA architecture can be programmed in industry-standard C with a few extensions. One takes a traditional C code that runs on the host (CPU) and offloads dataparallel sections of the code, or kernels, to the device (GPU). The CUDA architecture also supports a device-level application programming interface (API) and supports the upcoming OpenCL and DirectX 11 Compute standards.

This article presents an example that introduces various concepts in programming the CUDA architecture with standard C. Only the basic ideas are discussed here. For full coverage of the CUDA architecture, consult the CUDA Programming Guide. The process of programming the CUDA architecture follows a relatively simple path for users familiar with the C programming language. The GPU is viewed as a compute device capable of executing a large number of threads in parallel, and it operates as a coprocessor to the CPU, or host. Functions executed on the GPU are referred to as compute kernels.

Code to be executed on both the host and device can be contained in a single source file with a “.cu” extension. The code is compiled using “nvcc,” the CUDA C-compiler. When the resultant binary file is run, the C Runtime for CUDA coordinates the execution of host and device components.

This article will describe source code that demonstrates:

• Allocation of memory on the GPU device and moving data to/from that memory.
• A kernel—a function callable from the host and executed on the device by many threads in parallel.
• Calling the kernel from the host using an execution configuration—a means of specifying the number and grouping of parallel threads used to execute the kernel.
• Performing parallel computations using the built-in variables threadIdx and block- Idx to subdivide the problem domain.

In our example, CPU and GPU code is used to solve the Poisson equation by employing a first-order Jacobi finite difference scheme. The code simply iterates over every element in a 2D array, computing the value of the element from the four neighboring values and the source term at that point. This is done iteratively, gradually relaxing the array toward a convergent solution of the Poisson equation:

?u = f

discritized to:

u_x,y = ¼( u_x-1,y + u_x+1,y + u_x,y-1 + u_x,y+1 + h² × f_x,y)

where u is the value being computed, f is the source term (which is known), and h is the grid spacing (assumed to be equal in X and Y).

This type of finite difference computation is used in many more sophisticated simulations, including multigrid, thermal analysis, RF modeling, and transport simulation.

In the CPU example of this computation (Fig. 1), every element in the 2D array must be looped over in a serial manner, one element at a time.

Although there are more optimal methods for iterating over a 2D array with a CPU using cache-blocking and vectorization, this basic implementation facilitates porting our simple application to CUDA (see “CPU Code For Jacobi Relaxation”).

On a CUDA-enabled GPU, many processors can be executed at once, which enables developers to conveniently take advantage of a data parallel-programming methodology. On the Nvidia Tesla C1060 hardware, each of 30 streaming multiprocessors (SMs) contains eight single- precision processor cores and one double-precision core (Fig. 2). In total, there are 240 singleprecision processing cores and 30 double-precision cores on-chip. Each SM also has a small amount of on-chip shared memory, as well as specialized hardware for caching data formatted as textures or cast as constants.

The CUDA architecture is exposed to software applications as a set of grids, blocks, and threads (Fig. 3). Grids are of thread blocks that get executed on the device. Thread blocks are groups of threads that execute on individual SMs. Each thread has a unique ID and effectively maps to a single processor of an SM.

On the Tesla C1060, each thread has access to its own local register set, 16k of shared memory, and 4 Gbytes of device memory. All threads can read/write to any location in device memory and read/ write to any location in their block’s shared memory. Shared memory has much faster access latency than device memory, so it’s often used as a scratchpad to store intermediate results for computations, for buffering reads and writes to achieve optimal memory-access patterns, and for providing inter-thread communication within a block.

To port the Jacobi relaxation code to CUDA, we need to re-formulate the relaxation computation so that it can be computed in parallel. In the CPU code, we had to iterate an index over every element in serial. In CUDA, we assign a thread to each element in the 2D array. During each relaxation, each thread will take the neighboring four elements and the local value of f as input and compute one element as output, writing that element back to the array (Fig. 4).

Mapping this parallelism to the CUDA software architecture requires subdividing the array into 2D blocks, which are composed of 16x16 elements (with corresponding 16x16 threads). These will automatically be queued up and executed on the SMs. Blocks are overlapped to enable communication between them, creating a ‘halo’ of boundary source elements (blue in Fig. 5) loaded along with the elements to be computed (green in Fig. 5).

Continue to page 2

For a Jacobi relaxation, results shouldn’t be written back to the same buffer being used as input during the parallel computation. Therefore, all input values for the 16x16 block are first loaded into shared memory (which is local to the SM), then block relaxation is performed, and finally the results are written back to the array in device memory. This also boosts performance by reducing the number of device memory fetches, as seen in the CUDA implementation (see “CUDA Kernel Code For Jacobi Relaxation”).

In the CUDA implementation, the CUDA kernel function is identified by the type qualifier __global__. The built-in variables threadIdx and blockIdx provide the indices for the local thread and block. They also allow a unique index to be computed for each 2D array element. Threads within the same block have access to 16k of on-chip common shared memory (declared with the __shared__ type qualifier), which has much lower latency than device memory.

The __syncthreads() command provides a barrier such that all threads within a block synchronize before execution continues. In this example, the block of data is loaded from device memory to shared memory, then __syncthreads() is called to ensure that the data is all loaded before performing the relaxation computation and writing back directly to device memory. Using shared memory reduces the number of device memory reads by a factor of four, which is important because reads from device memory can experience several hundred cycles of latency compared to just a few cycles of latency for reads from shared memory.

Also, memory coalescing can be exploited. Here, adjacent threads cooperate to load a contiguous segment of up to 128 bytes of device memory in one read operation.

Now that a kernel is written in C for CUDA, it must be called from the host CPU. First, allocate memory on the GPU device and copy the array data from the host RAM to the GPU device memory. This is done using the cudaMalloc(…) and cudaMemcpy(…) commands (see “CUDA Setup Code For Jacobi Relaxation”). Note that cudaMalloc(…) returns a pointer to an address in device memory. It also can be used in kernel calls that will access that device memory. However, it can’t be accessed from the host code, because the address it points to is not in the host system RAM, but in device RAM on the GPU.

Via the PCI-e bus, cudaMemcpy(…) initiates a DMA transfer from host RAM to device RAM, with the last parameter controlling direction of the transfer between host and device memory.

Next comes the setup of an execution configuration. Algorithmically, this involves dividing the problem into blocks of threads so that it can be computed in parallel. In this example, we subdivide the 2D array into blocks that are 16x16 elements in size (with a two-element overlap boundary shared between blocks). To set this up in CUDA, use dimGrid (a variable of type dim3) to specify how many blocks there are in X and Y and dimBlock to specify how many threads are in a block (one thread per array element).

The kernel with the execution configuration is called by using the <<>> qualifier between the function name and parameter list. Because kernel calls are non-blocking, the CPU could run other code while the kernel computation completes.

The computation results are then moved from device to host memory using the cudaMemcpy(…) command. From here, it can be output to a file or formatted for visualization. Once completed, the device memory is freed using the cudaFree(…) command.

In a real application, the kernel gets called multiple times to iterate on the array in device memory, converging toward the solution much faster than if we had to move the memory back and forth between device and host. The PCI Express bandwidth limits host-to-device transfers to about 5 Gbytes/s, while the device memory’s bandwidth is on the order of 100 Gbytes/s. Thus, it’s advantageous to keep computations within the 4 Gbytes of device memory on the Tesla C1060.

Although this is a simple example, it shows how to rewrite a typical serial computation to execute in parallel and demonstrates an implementation on the GPU using C for CUDA. With minimal coding effort, it went from having one CPU core serially processing each element of an array to having a GPU use 240 cores to process the array elements in parallel.

In addition to having more cores to accomplish the computation, the GPU has higher memory bandwidth than a CPU. Thus, a significant performance increase is expected.

Figure 6 shows how the performance scales on a Tesla C1060 GPU with the code in this example. For larger arrays, the single-precision version can process nearly 5 billion array elements per second and the float-precision version about 2.5 billion elements per second.

Although there’s eight times the number of single-precision processing cores, performance only increases twofold from double to single precision. This is due to the application being memory bandwidth bound and double-precision data being twice the size of single precision. This simple example of Jacobi relaxation of the Poisson equation demonstrates a significant performance increase over CPUbased implementations. It also shows how one can achieve at least an order of magnitude boost in speed over a CPU for many similar applications.

For more computationally intensive applications, such as molecular dynamics and n-body problems, even greater acceleration occurs because memory bandwidth is no longer the bottleneck and the 240 cores can run at full occupancy. In these cases, 20X jumps in speed are possible and have been demonstrated in academic and commercial codes such as Nanoscale Molecular Dynamics (NAMD) from the University of Illinois at Urbana-Champaign.

Computations such as finite difference methods can also scale to multiple GPUs and multiple nodes. It’s possible to configure a workstation with up to four Tesla cards, giving it a total of 960 cores, 16 Gbytes of device memory, and almost 4 TFLOPS of peak computing power.