Multicore Heavy Hitters

To list multicore solutions available for general embedded development, the hex core Xeon by Intel (see the fig.), Nvidia’s GPU (see the fig.), and XMos’ XC-1 (see the fig.) multicore platforms are a great starting point. I also took a look at some of the software support for these platforms, like Intel’s Thread Building Blocks (TBB) for x86 platforms, plus CUDA and OpenCL for GPUs. Actually open computing language (OpenCL) has a broader working context.

At the top is the server workhorse from Intel, its hex core Xeon. This particular platform may be superseded by the time this article is published, but it will simply be a faster, more power-efficient platform. The four-chip system Intel loaned me allowed testing of TBB and virtual machine environments with 24-cores to manage.

The next multicore platform was NVidia’s GTX 260 video adapter. I provide a short overview of this board as a video adapter, but the main reason for testing it was to check out CUDA and OpenCL. CUDA opened up NVidia’s graphics adapters to programmers, allowing its use in non-graphic applications in addition to providing closer contact for video applications.

Finally, there is the tiny Xmos XC-1. This chip is so cool it doesn’t need a heat-sink even while running 32 hardware managed threads on its four cores. Compare this to the robust cooling systems of the two other platforms in this article and you can get a feel for multicore in mobile devices.

Lots of Xeons Cores
Intel’s development platform belongs in a rack well away from prying eyes and ears. Turn it on and it sounds like an aircraft carrier, due to the pair of hot swappable power supply/cooling systems. Its par for the course when it comes to a 3U rack mount system, but it made the labs a bit noisy.

The box also highlights the trend to 2.5-in. hard drives in the corporate server environment. It allows more drives to be part of a RAID system. It is also leading to changes in RAID support and the arrangement of drives because of its smaller size. The row drives on this system only utilizes a fraction of the interior space, allowing more space for the motherboard and four hex core processors. But, in many instances a rack system may just be storing drives, leaving lots of wasted space if the drives are only in front (as was mine).

I used the platform to check out a range of software from development tools to virtual machine management. The first chunk of software I looked at was Intel’s TBB, running under Windows since Windows Server 2008 was already installed on the system. I’ve seen TBB before (click here to read the article) so I won’t get into the details here. Suffice it to say that having 24 cores available to TBB means applications are very, very fast. What has changed lately is the release of Intel’s Parallel Studio. This includes TBB along with a lot of other tools, plus integration Microsoft’s Visual Studio. It was usable with Microsoft tools, but it was a bit of a challenge. Some of the new items in Parallel Studio are Parallel Advisorm which came in handy when starting a project. The Parallel Inspector incorporates additional debugging support that is integrated with Visual Studio. Finally there is Parallel Amplifier that uses Intel’s Thread Profiler and the VTune Performance Analyzer.

That’s a lot of software, and I did not get to exercise all of it to any great degree, but I can see where it will be invaluable for developers. The tools provide significantly more insight into the operation of a TBB application even if you are using a runtime library rather than writing your own parallel code.

After putting TBB through its paces I overwrote Windows Server with a couple Linux installations, including Centos and Ubuntu, just to see what they recognized and to try out the system with lots of virtual machines. Xen was the virtual machine manager (VMM) of choice since I have a number of systems configured for Xen already and I could grab a couple images to use as test subjects.

Limited primarily by the amount of RAM that Intel sent along with the system, the system hummed merrily away. With 24 cores, you want as much RAM as you can afford.

VMM is something many IT specialists have had, but embedded developers tend to be dealing with a much smaller number of cores. Still, a standard desktop with a single processor machine and VMM support can handle many virtual machines, so you can exercise the management of such platforms easily. Things become more interesting with 24 cores, but it tends to be more of the same.

TBB and CUDA run on Linux as well. I tried CUDA when I had Linux installed. It is the same version that I looked at later with Nvidia’s GPU. I did not get a chance to check out OpenCL—may be something for the future. The latest version of CUDA actually supports OpenCL. What was more interesting was installing CUDA on a virtual machine, and then using the same image on the Intel system, and later on another server with a similar configuration, but only a single multicore chip.

The only difference I found between systems was performance when dealing with large datasets. Of course they were all arbitrary items like hi-res images. The bottom line, this system is clearly one of the fastest around.

Working with such a high-end server was a new experience for me. It is one that many IT managers are used to, as most corporate environments have racks and racks of systems of this caliber; but it is unlikely that many embedded developers will have the same opportunity. Still, the number of cores on a single chip continues to climb and embedded dual and quad core systems are growing in number. Developers will need to be aware of the depth and breadth of such systems before it is too late. Debugging is one area where the change will be more radical, but that will have to be left to another article.

GPUs: Not Just For Graphics Anymore
While Nvidia has a number of products, including a mobile embedded platform, it is primarily known for its graphics boards. Lately, it has also been known as a purveyor of its multicore GPUs into the more general computing space. Its GPU chip uses a single instruction multiple thread (SIMT) architecture (read “SIMT Architecture Delivers Double-Precision Teraflops”), which is unique and it lends itself to group task-style multithreading. That is a good match for a range of applications beyond graphics rendering.

This has led to a whole line of boards from Nvidia that are dedicated to computing like the Tesla C1060 PCI Express board. The C1060 is much like the GTX 260 I tested except that it has no display output. The GTX 260 contains 192 1.2 GHz processing cores plus 896 Mbytes of RAM. It is one of the heftier PCI Express-based graphics adapters that requires the extra PCI Express power cable. The new top of the line is the GTX 295. If you have the right motherboard you can put in more than one board and connect them together in a cooperative fashion for driving one or more graphics displays. This same, over the top, cable connection is not required if the units are being used as compute platforms.

I won’t get into a detailed review of the GTX 260 since there are more gaming magazines and Web sites that have done more detailed overviews of this and its later siblings. It can handle 3D games with true 3D output via LCD shutter glasses, if you have the right display hardware. The results are great but, that’s for another article.

What I did look at was the CUDA support provided by this board. CUDA is an acronym for Compute Unified Device Architecture but no one ever uses that mouthful so CUDA it is. I took a look at CUDA alone but in the future OpenCL will likely be the support that most will talk about. OpenCL is Apple’s brainchild, but it’s being adopted by Nvidia as part of CUDA. The big question will be what will wind up where.

For this discussion I concentrated on exercising CUDA.

So what is CUDA? Well, check out Nvidia’s Web site for the details, but essentially it is a programming environment that provides controlled access to the GPU on the GTX 260 or any other late-model Nvidia graphics chip or platform like Tesla. Typically a GTX 260 would handle graphics display chores but, even running Vista’s Aero it usually has more horsepower left over for other things. It is only when playing computationally or graphically demanding 3D games that the GTX 260 really gets a workout. The rest of the time the chip is like most PCs, idle.

In the past, the graphics processor was off limits, hidden behind a device driver. Well, the device driver is still there, but now it lets a lot more in and out. In particular, it can support CUDA applications and runtime libraries. It can do this to the exclusion of the graphics display support hence the Tesla line with no graphics output connectors or it can be used on demand as in a multi card GTX 260 installation.

This had advantages in 3D gaming as well as other areas because those 192 cores can be used for other chores such as physics simulation support in games or computations that analyze 3D ultrasonic information from a breast scanning machine. The latter turned a multi-day processing chore on a multicore x86 platform into something like a fifteen minute wait, providing nearly real-time medical feedback using four Tesla boards.

So, to take advantage of CUDA I simply popped the x16 PCI Express GTX 260 video board into a desktop PC; installed the latest drivers; downloaded the CUDA development software, and I was in business. I was able to test some CUDA applications simply by having the board and device driver installed. Nothing special is required. This will be the same kind of support that is needed for incorporating more advanced physics engines in games just as an example.

Things get more interesting from a programmers standpoint when writing applications to take advantage of a CUDA-enabled system. The first step was to use the array manipulation runtime support available with the CUDA tools. This is relatively easy, as you can essentially ignore the SIMT architecture, but you will have to get used to the memory allocation functions and other runtime support provided by the library, because data must be moved into the GPU’s memory for processing. This is something that always happens with a graphics display, but it is effectively hidden from the programmer. Using CUDA requires this to be a little more explicit.

Other than that, you have access to filters and other array processing functions, just as you would with a conventional math library. This same type of support is utilized by tools such as The Mathwork’s Matlab when using CUDA as a backend processor for array operations.

The next step was to move closer into the CUDA framework and take advantage of the extensions to the C compiler. This took a bit more investigation into the documentation. CUDA is not hard to work with, but it is involved. I don’t purport to have any great insights into the extensions, given the limited amount of time I have used the system, but it will quickly become second nature to any programmer. For the most part, the extensions are easy to understand and utilize.

As with TBB, the first thing to master is memory management, including the use of CUDA arrays. While TBB works with arrays in memory, CUDA must assume that data might have to be copied to a work area such as the GPU’s onboard memory. As noted, CUDA can run on platforms that support TBB, but the idea is to make the interface transparent so the application can run on either platform. Next comes the parallelization and synchronization syntax and semantics. There are also the runtime library functions if you plan on using those.

Plan on a couple days to get a system set up and to read through the docs. This is definitely a time when reading helps. There are plenty of demo apps on the CUDA Web site and incorporation of the CUDA compiler and support is no more than a make file configuration away.

I did not get into the debugging aspects in much detail since I was primarily working with the demo code and applications. There is an emulator where debugging is significantly easier than on the hardware. Otherwise single stepping is not an option. The emulator is great for small applications and debugging a small collection of functions, but it would be interesting to get some feedback from Nvidia developers or others that have worked on larger applications and what debugging techniques were used on large applications. Issues like deadlock are just the tip of the iceberg that will require new tools and techniques to be mastered those moving into parallel programming.

CUDA is going to make a major difference in a number of areas from gaming to medical technology. The fact that it can often be found on a desktop is a definite plus and the performance increase can be substantial. Still, the SIMT architecture is amenable to only some parallel processing chores but it is definitely a platform worth investigating if you have to crunch numbers. There is now a reason to get a great gaming board on your PC other than to play games.

Little Cores. Big Ambitions
I took a look at the XMos XDK development kit (see “Multicore And Soft Peripherals Target Multimedia Applications”) last year. It is still the platform of choice for those looking to take full advantage of the chip, but the new XC-1 contains the same XS1-G4 chip. It also costs significantly less at $99.

The same Eclipse-based development tools come with this new platform. The main difference will be some of the peripheral support including external connections. The peripheral ports are brought out but the headers are not installed. Luckily we have a couple soldering irons here in the lab, so adding standard headers and creating ribbon cables is easy.

The board has a USB interface to link it to the PC-based development tools. If you want to use the built-in LEDs that encircle the chip and on-board switches and concentrate on software development, then unpacking and plugging in the system will take minutes. Getting the software installed will take a little longer but you should easily have the test applications up and running within an hour or two.

Getting up and running with the XC-1 is trivial with a Windows PC. Plug in the USB connection—which also powers the board; install the drivers and IDE should be up and running in under an hour.

The challenge will comes after trying the demo applications and running through the online documentation and Quick Start Guide. Communication between tasks is done using channels. These can be mapped to the hardware channels that link cores and chips in a multicore environment. The interface ports will be a bit more familiar to embedded developers, but the XS1-G4 is designed for bit-banging with soft peripherals. It is a bit much to explain here so check out the XMos Web site and downloadable documentation. This approach takes advantage of the task hardware scheduling provided by the chip.

Programming the XC-1 tends to be easier than something like the GTX 260 because you use a regular C or C++ compiler for most work. If you are working on device drivers or you want to take advantage of the scheduling or communication hardware then the XC compiler will come into play. There are limited parallel programming and communication functions to deal with—so there is a relatively small learning curve.

XMos also has a simulator (xsim and xiss) that you can take advantage of. The chip and interface are fast enough that hands-on work is easily done with the board.

The XC-1 is the best way to check out the software and tools for the XS1-G4 chip. Pull out the soldering iron if interfacing comes into play or check out the more expensive XDK.

Intel
www.intel.com

CUDA
www.nvidia.com/cuda

Nvidia
www.nvidia.com

OpenCL
www.opencl.org

XMos www.xmos.com