ARMv8, GPUs And Knights Landing At ISC 2014

The International Supercomputing Conference is the place to be to see the latest massively parallel processors with humongous storage capacity that is delving into big data. Plus it is fun to check out the latest architectures and combinations that are finding they way into the cloud where everyone from academics to experimenters can grab a few billion cycles.

This year's event is no different from the past. All the big names are on hand to show off their new toys. Many of these have been hinted at or highlighted in the past but now we get to see the details, and, in some instances, see real hardware that is shipping.

ARM's 64-bit ARMv8 architecture (see “Delivering 64-Bit Arm Platforms”) is finally seeing the light of day with production systems from a range of vendors. AMD has already released their Opteron A1100 based on the Cortex-A57 (see “AMD ARMs 64-Bit Servers”). Applied Micro Circuit's X-Gene (Fig. 1) is based on the ARMv8 architecture and it incorporates Applied Micro's enhancements including a high speed interconnect. X-Gene has found home in platforms like Hewlett-Packard's Moonshot and the Aurora water-cooled HPC system from Eurotech. Eurotech's Brick Technology employs direct hot liquid cooling.

Electronicdesign Com Sites Electronicdesign com Files Uploads 2014 06 113401 Fig1sm Appliedmicro Xgene

Figure 1. Applied Micro's SoC uses their own design based on the ARMv8 architecture.

Cirrascale is delivering a system that combines the X-Gene with NVidia's Tesla K20 GPU (see “Expect 3D Printers, 3D Vision, And More In 2013”). The Cirrascale RM1905D 1U server (Fig. 2) runs the X-Gene SoCs on the motherboard and supports a pair of NVidia boards. The interface is via PCI Express.

Electronicdesign Com Sites Electronicdesign com Files Uploads 2014 06 113401 Fig2 Cirrascale

Figure 2. Cirrascale's RM1905D ties a pair of NVidia Tesla K20 GPUs to Applied Micro's X-Gene SoCs.

The Tesla K20 is based on the Kepler GK110. It can deliver 1.17 double precision TFLOPS and 3.95 single precision TFLOPS. It has 5 Gbytes of DDR5 memory and a 208 Gbyte/s memory bandwidth. It incorporates 2496 CUDA cores. CUDA is NVidia's GPU programming language. The Tesla K20 also supports OpenCL.

ARM has garnered quite a few partners but it was not the only thing at ISC 2014. Intel's “Knights Landing” Xeon Phi (Fig. 3) based in Intel's Many Integrated Core (MIC) architecture is being unwrapped at the show. It is a follow on to Knights Corner and the 14 nm chip has 72 cores that are based on the Intel Atom Silvermont architecture. These cores execute four threads and support the AVX-512 vector math (see “Intel's AVX Scales To 1024 bit Vector Math”). The cores are organized in a 2D fabric with two cores per tile or node. There are two AVX-512 units/core. The single core performance is three times that of the previous generation.

Electronicdesign Com Sites Electronicdesign com Files Uploads 2014 06 113401 Fig3sm Intel Knights Landing

Figure 3. Intel's Knights Landing Xeon Phi uses a new Omni Scale Fabric to connect Intel Atom Silvermont architecture cores with 16 Gbytes of L3 memory from Micron that employs technology similar to that found in the Hybrid Memory Cube.

It has 16 Gbytes of memory from Micron uses technology similar to the Hybrid Memory Cube (HMC) with a 15 Gbit/s port speed (see “Hybrid Memory Cube Shows New Direction For High Performance Storage”).

Intel tailored the interconnect and memory solution for the new system. The Omni Scale Fabric will be the new interconnect for Xeon Phi going forward. It is functionally similar to InfiniBand. The True Scale technology using by Intel in prior iterations was compatible with InfiniBand. Omni Scale is a different system but it operates the same from a protocol stack view so it will be transparent to most applications. Omni Scale uses Intel's Silicon Photonics optical interconnect for off-board communication.

Knights Landing also has 36 PCI Express v3 channels. This can be used with other peripherals including GPUs. The chip has a 200-W TDP.

The 16 Gbytes of multichannel DRAM (MCDRAM) memory supplied by Micron has five times the bandwidth of of the off-chip DDR4. The Xeon Phi has six DDR4 controllers. The on-chip memory is very dense consuming less than one third the space of conventional DDR while being five times more power efficient. The MCDRAM is comparable to the short reach HMC channel definition.

Intel also recently announced that they had combined a Xeon and FPGA onto a single package. This is being delivered to select customers and the FPGA vendor was not disclosed although Intel has worked closely with Altera in the past to create the Intel E600C system-on-chip that combined an Atom core with an Altera Arria II FPGA. Altera and Xilinx FPGAs have been interfaced directly with Xeon processors using Intel's QuickPath Interconnect (QPI) interface (see “Storage And Computation Capacity Continues To Grow”). In this instance FPGA modules were plugged into a processor socket in a multi processor motherboard.

Developers using OpenCL can take advantage of these FPGA/CPU combinations using tools like Altera's SDK for OpenCL (see “OpenCL FPGA SDK Arrives”). This tool takes OpenCL code that could also run on CPUs or GPUs and converts it to FPGA configuration information.