Battle Of The Supercomputing Nodes

This year's supercomputing conference, SC12, delivered some of the hottest new technology that you can actually buy. Offerings from Intel, NVidia, and AMD target computing nodes in high performance computing (HPC) applications. These compute platforms include GPGPUs and are typically paired with one or more multicore CPUs that minimally handle network traffic. They are designed to deliver massive double precision floating point computations in minimal time assuming programmers can tame the data and the compute nodes.

Intel MIC (Many Integrated Core)

Intelâ€™s Many Integrated Core architecture has been in the works for awhile but getting ones hands on these was a bit difficult (see Get Ready For Some Hard Work With Multicore Programming). The new Xeon Phi (Fig. 1) changes that. It is based on Intel's 22nm 3-D Tri-gate technology (see Moore's Law Continues With 22nm 3D Transistors) that was initially available with Intel's Ivy Bridge processors (see Understanding The Ivy Bridge Architecture).

Figure 1. The Xeon Phi chip implements Intel's' MIC (Many Integrated Core) architecture.

The Xeon Phi Coprocessor 5110P (Fig. 2) is now available in a x16 PCI Express card similar to the offerings by AMD and NVidia. It is designed to work with a host processor that interfaces to the coprocesser via the PCI Express interface. The big difference between Intel's offering and the GPU solutions is that the Xeon Phi is a 1.053 GHz cluster of 60 quad threaded, x86 processors. This translates to 240 threads that can deliver up to 1 TFLOPS of double precision performance. The boards host 8 Gbytes of GDDR5 memory with a 320 Gbyte/s bandwidth.

Figure 2. The Xeon Phi Coprocessor 5110P implements Intel's' MIC (Many Integrated Core) architecture.

The Xeon Phi cores trail the latest Intel processors in architecture but they do incorporate features like a 512-bit wide vector engine (Fig. 3). The Knights Corner core has a 32 Kbyte L1 instruction and data cache and a 512 Kbyte L2 cache. The cores do support 4 threads but they do not employ the same type of "hyperthreading" found in Intel's Core processors. The Xeon Phi's multithreading is more amenable to HPC applications where each thread has enough resources so it will not be blocked. Hyperthreading is designed to optimize hardware utilization. Multithread is designed to optimize thread performance. Many HPC applications disable hyperthreading. Virtualization is also a feature not found in the cores because the initial target is HPC versus virtualized cloud services.

Figure 3. The 60 cores in the Xeon Phi are based on the Knights Core architecture that includes a 512-bit wide vector engine.

The cores are linked via a high speed ring that links the caches and the memory controllers as well as the shared PCI Express interface (Fig. 4). The system does provide cache coherency providing a very large symmetrical multiprocessing (SMP) environment.

Figure 4. The Xeon Phi employs a high speed ring to link the cores and their memory systems together.

The Xeon Phi can run standard operating systems like "Red Hat Enterprise Linux (RHEL). This alone sets it apart from the GPU platforms that can only run specialized applications written using frameworks like CUDA and OpenCL.

Typically the 60 cores are running a single operating system and appear within a larger network as a node with its own IP address. The host processor provides the communication support but applications that would run on the host can work equally well within the Xeon Phi assuming they were written in a portable platform like Java or recompiled to address the Xeon Phi nuances.

Intel's Parallel Development Studio (see Dev Tools Target Parallel Processing) supports the Xeon Phi. Likewise, a number of applications like the massively parallel Lustre Parallel File System can run on the platform. The system can support a range of multiprogramming paradigms that already run on CPU clusters such as OpenMP, Intel's Cilk Plus and Intel's Thread Building Blocks (TBB). It can handle applications that are heavily dependent upon vectorization performance as well as lots of serial applications that would not fare well on GPUs.

NVidia Tesla

NVidia's latest Tesla platform is based its Kepler architecture (see GPU Architecture Improves Embedded Application Support). Kepler employs a new SMX (streaming multiprocessor) processor architecture (Fig. 5). Where Fermi's SM architecture had 32 cores/block, Kepler provides 192 cores/block. This GPU block oriented architecture is why GPUs differ from CPU solutions like Xeon Phi. Each block of cores operates in tandem on different data. This works well for vector-oriented applications such as graphics.

Figure 5. NVidia's new Kepler employs 192 cores/block using the SMX (streaming multiprocessor) architecture.

Kepler makes a number of improvement over NVidia's earlier Fermi architecture. One of these is the ability to dispatch new jobs locally (Fig. 6). NVidia calls this "dynamic parallelism".

Figure 6. NVidia's dynamic parallelism support allows a kernel to dispatch other kernels without host support.

GPUs run blocks of code called kernels. These are initially dispatched by the host that then utilizes the results of the computation. Typically this causes additional computation to be required and new kernels will be applied to the results generating additional results. With Fermi, and most GPUs, the host processor handles this dispatch process as well as memory management. Dynamic parallelism support allows a kernel to start additional kernels that will utilize the results from the initial kernel. Eventually the host comes into play but without having to manage the internal interactions. This approach is not a fully programmable environment but it does reduce overhead.

The NVidia K20 (Fig. 7) and the K20X are based on the Kepler architecture. They have 2496 and 2688 cores respectively delivering 3.52 and 3.95 TFLOPS of double precision performance. The K20 has 5 Gbytes of memory while the K20X has an additional 1 Gbyte of VRAM. Both are based on TSMC's 28nm technology.

Figure 7. NVidia's K20 runs 2496 cores at 706 MHz delivering 3.52 TFLOPS of double precision performance.

NVidia's hardware supports programmers using CUDA and OpenCL. CUDA targets NVidia's platforms while OpenCL works with a wide range of GPUs as well as CPUs.

AMD FirePro

NVidia has platforms that target computing such as Tesla as well as more conventional graphics adapters. AMD's FirePro (see Graphics Adapter Targets Medical and Legacy Applications) targets all these markets. The FirePro V9800 (Fig. 8) has 1600 stream processors and 4 Gbytes of GDDR5 memory. The double wide card plugs into a x16 PCI Express slot and consumes 225W of power. The FirePro V9800 delivers 528 GFLOPS of double precision performance and 2.64 TFLOPS of single precision performance.

Figure 8. The FirePro V9800 has 1600 stream processors and 4 Gbytes of GDDR5 memory.

The FirePro supports OpenGL and DirectX for graphics output and OpenCL for computing chores. This type of support spans AMD's GPU line as does NVidia's CUDA support. This allows most GPU-equipped PCs to take advantage of the GPU as a processing engine for some applications because even consumer applications like image processing can benefit from a less powerful GPU than the V9800.

The FirePro S9000 competes with platforms like the Tesla K20. The S9000 packs in 6 Gbytes of GDDR5 memory with ECC support. It bumps up the double precision performance to 806 GFLOPS and single precision performance is 3.23 TFLOPS. It also has a DisplayPort connection.

AMD has also tied in the FirePro architecture in its APU (Accelerated Processing Unit) architecture (see APU Blends Quad Core x86 With 384 Core GPU). The GPU in most of the system to date is less powerful than something like the V9800 but the pairing offers some interesting scaling possiblities within the HPC community.

AMD also has the SeaMicro fabric-based computing platform that sits inside a 10U rack (see 10U Rack Packs 512 Atoms). It has create a hypercube linkage between a range of processors that started with dual core Atoms. It now encompasses AMD Opteron and Intel Xeon processors and delivers hundreds of conventional cores. Using an APU would bring GPUs to the mix without bursting the power budget that high performance GPUs typically would do.

Arm Mali

Talking about HPC and Arm is probably a bit premature at this point but a few things are changing that may put Arm cores into the HPC space. The first is Arm's 64-bit offering that companies are starting to deliver on (see The Waves Of 64-bit ARM and Windows 8 Systems Are Going To Have An Impact). The Arm Cortex-A57 targets performance computing while the 64-bit Cortex-A53 is designed for low power applications.

The ARMv8 architecture brings the CPU architecture into the realm of x86 processors that dominate this space. The Arm Mali GPU (see Mobile GPU Architecture Supports Emerging Compression Standard) adds another leg to the discussion. The Mali GPU will not be found in a standalone configuration. Instead it is blended in an AMD APU-style configuration with the GPU and CPU coexisting within a cache corherent environment. The number of GPU and CPU cores vary depending upon the implementation but the idea is the same.

Mali supports OpenCL and OpenGL so it is likely to be used for display as well as compute chores. At this point the GPUs have been at the low to mid range of the performance spectrum with designers stressing low power and efficiency since most of the target applications are mobile or consumer oriented. Higher performance GPU computing environments need more horsepower and that requires more power.

Arm platforms in just starting to make inroads into the server space. The HPC space is a smaller and more demanding environment but one that may be on the horizon.

Other Many Core Options

Intel's Xeon Phi and the GPUs from AMD and NVidia are all the rage in HPC because of the large number of cores that can be delivered. SeaMicro's architecture could be used in that space but for now it is targeting the cloud. There also other vendors that are delivering many core solutions that are targeting specific applications other than HPC. For example Tilera has a family of multicore chips that actually do the Xeon Phi one better when it comes to programmability. (see Many Core Systems Handle Network Processing Chores).

Tilera's architecture uses a mesh fabric with multiple channels to provide a cache coherent environment to 64-bit VLIW processors. The SMP environment support standard operating systems like Red Hat Linux and the architecture supports virtualization as well as paritioning. This allows cores to be grouped and isolated, a feature invaluable for cloud computing. The latest Tile-Gx9 (Fig. 9) actually moves in the opposite direction reducing the number of cores to 9.

Figure 9. Tilera's Tile-Gx9 has nine 64-bit processing cores.

The other aspect of Tilera's chips is the heavy support for communication and interfaces. It has multiple PCI Express lanes as well as multiple Ethernet ports including 10G support. Hardware accelerators are tailored for network communication which is the target market for Tilera.

No one many core solution fits all problems and finding the best alternative for a particular problem can be a challenge. Requirements vary from application to application. Performance per watt is sometimes the most important aspect while throughput, possibly network throughput, could be more important. Knowing what the possibilities are helps.