NVidia's GPUs adorn desktops to supercomputers/HPC (high performance computing) clusters. GPUs in general have changed the way developers view parallel computing. NVidia's CUDA (see Programming The CUDA Architecture: A Look At GPU Computing) has allowed developers to improve computational performance by a factor of 10x to 100x compared to multicore CPU architectures. Part of this number of cores a GPU contains compared to CPU. Still, GPGPUs (general purpose graphics processing unit) cannot improve all parallel applications but a very large number fit under the GPU umbrella.
NVidia's prior Fermi architecture moved GPUs into the GPGPU space in a big way. The new Kepler architecture (Fig. 1) improves on Fermi with a 3x performance per watt increase. It can deliver 1 petaflops of performance using 10 racks and consuming only 400kW. Kepler also incorporates a range of architectural changes designed to improve computations chores running on Kepler.
Kepler employs a new SMX (streaming multiprocessor) processor architecture (Fig. 2). Fermi's SM architecture has 32 cores/block while Kepler has 192 cores/block. They have about the same amount of control logic so the control logic overhead for Kepler is significantly lower. The latest chip has half a dozen memory controllers and PCI EXpress Gen 3 support.
Programmers gain access to the GPGPU capabilities using tools such as NVidia's CUDA and the Kronos Group's OpenCL. One of the challenges with these frameworks is managing memory. Usually the CPU needs to be involved with multi-chip GPU solutions. Kepler reduces the need for CPU interaction with the ability to move data directly between GPUs (Fig. 3).
Multipe GPU environments are cropping up more often. This is especially true for environments like HPC clusters that are connected via Ethernet adapters. Kepler will work with many, but not all, Ethernet interfaces connected to the PCI Express interface without the need to interact with the CPU.
Reducing CPU overhead is significant because the CPU is often handling other aspects of a parallel application that cannot be handled efficiently by the GPUs. Likewise, data flow takes longer with CPU/GPU interaction because of synchronization issues. Parallel applications are designed around code kernels that work on arrays of information that must be local to the GPU. Applications typically the output of one kernel is utilized by another kernel. The ability for the GPU to move the data from one GPU to another allows the kernel code to handle this part of the transaction. It also means the subsequent kernels can be local or on any GPU with a cluster. This allows better distribution of compute capabilities.
Another way Kepler reduces CPU overhead is with support for dynamic parallelism (Fig. 4). In Fermi, a CPU had to initiate all kernel jobs on the GPU. The CPU still initiates a kernel to get the ball rolling but now active jobs on Kepler can initiate other jobs. With Fermi, a kernel would have to indicate to the portion of the application running on the CPU that additional work would be necessary. The CPU application would then schedule the new workload. This meant that the CPU would have to respond and then the GPU would have to pick up the new jobs. In addition to the handshaking overhead there was the potential for the GPU to be idle.
Dynamic parallelism improves overall throughput and reduces overhead. It can be combined with the distribution of data to other GPUs although it is not possible to initiate jobs across GPUs at this point. Still, being able to initiate jobs on the same GPU makes a lot of sense since the results from one job are normally needed by the next otherwise the jobs would be independent and could be scheduled by the CPU. This approach means jobs can be initiated based on the data.
Keeping job queues full and GPUs running fully loaded is typically the way to get the best performance out of a GPU system. Another feature that the Kepler archticture supports and will show in subsequent chips is Hyper-Q (Fig. 5). Hyper-Q allows multiple CPUs to feed job queues that are in turn used to feed the GPU.
Fermi does support 16-way concurrency but eventually all the tasks funneled through a single hardware queue. Kepler supports 32 simultaneous tasks with their own hardware queues. The Kepler blocks have access to all the queues thereby reducing GPU idle time. The queues support CUDA and MPI (message passing interface) processes.
In theory, Kepler's Hyper-Q eliminates false data dependencies that could occur due to Fermi's single queue architecture. This allows some applications to run Kepler and see a 32x performance increase without reprogramming simply because the jobs can now operate in parallel on a single GPU.
Hyper-Q is needed because the CPU/GPU combination these days is really a multicore CPU/multicore GPU combination. The CPU can be running multiple tasks that will initiate jobs on the GPU and the GPU is running a large number of relatively or totally independent jobs. Hyper-Q simplifies the programmers job because it eliminates the bottle neck feeding jobs to the GPU.
Overall, the new Kepler architecture is designed to reduce overhead and idle time and simplify GPU application development. The architecture also adds new instructions and the memory architecture has bee improved as well. It supports faster atomic processing and the memory system employs a low-overhead ECC architecture.
NVidia is delivering the Tesla K10 with the GK104 Kepler chip. It supports single precision floating point and fixed point data and is based on the SMX block of cores. The Tesla K10 has two GPUs with a total of 3072 cores.
The Tesla K20 is in the near future with support of Hyper-Q and dynamic parallelism. It supports double precision floating point and delivers 3x the performance of the Fermi-based solutions.
NVidia had a number of announcements related to Kepler that I will write about later but one is of note. This is the contribution of the CUA compiler support to the open source LLVM compiler infrastructure project. This allows mixing of programming languages as well as allowing CUDA to be used on a range of targets, not just NVidia hardware. It puts CUDA on a more level playing field with OpenCL.