Programming To Survive Multicore

While many of us are just beginning to write parallel programs to use multicore, we can already see a crisis looming for those who use raw threads (p-threads or Windows threads) to implement parallelism. More abstract parallel-programming methods have

James Reinders

Feb. 28, 2008

9 min read

And there’s a lot of future to survive. We soon will face a transition from multicore processors (eight cores or less) to tera-scale processors (more than eight cores). With dual-core processors nearing three years old, and eight-core processors coming later this year, this is a timely topic to consider.

Intel’s open-source Threading Building Blocks (TBB) project handles decomposition of a workload and load-balancing, so it isn’t hard-coded into our programs. Ultimately, TBB provides the appropriate level of abstraction that will help us survive the future with parallel programs written today.

Tera-scale computing Dual-core processors from Intel arrived in 2005, with quad-core processors following in 2006 and second-generation (Hi-K in 45 nm) quad-cores in 2007. Octo-core processors from Intel are expected in late 2008. All of these devices are similar to program, and despite some differences in cache configurations, each processor has a certain number of cores all connected to the same memory.

As we move beyond eight cores, multicore gives way to tera-scale computing. Maybe it’s not exactly eight cores, but even if we categorize 16-core devices as multicore technology, we certainly should not expect 100 cores all connected to memory symmetrically and with the same memory access time.

I say “certainly” because that’s the way it is with multiprocessor computer systems. There’s nothing to suggest that a multicore processor will overcome all of the connection challenges of prior multiprocessor systems.

Processors are going to change as they get more cores. We cannot expect to write parallel programs today that will run well in the future if we program them at a low level without abstraction. The good news is that we have choices.

Picking the right abstraction We need to look for three things in a programming language for parallelism: scaling (add more cores, expect more performance), ease of debugging (avoid pesky deadlock and race conditions), and ease of coding and later maintenance.

Each is much easier to do using an abstraction for parallelism instead of raw threads (Windows threads or POSIX threads). Raw threads are the assembly language of parallel programming.

So if avoiding Windows threads and p-threads is the answer, what does that mean? For Fortran and some C programmers, it means OpenMP more than any other solution. For C++ and C programmers, it means TBB.

TBB outfits C++ for Parallelism As the name suggests, Threading Building Blocks consist of multiple pieces that together extend C++ for parallelism. TBB supports scalable parallel programming using standard C++ code. It doesn’t require special languages or compilers. The ability to use TBB on virtually any processor or any operating system with any C++ compiler makes it very appealing.

Most parallel-programming packages haven’t been nearly as complete, offering only a few algorithms and missing the many other components that are required to fully address the needs of a real program. TBB includes seven very significant blocks to build upon:

Basic algorithms (parallel_for, parallel_reduce, parallel_scan)
Advanced algorithms (parallel_while, pipeline, parallel_sort)
Concurrent containers (to replace/augment STL, which is not thread-safe)
Scalable memory allocators (much better than new/delete, malloc/free)
Portable mutual exclusion mechanisms (MuTEXs and atomic operations)
Portable, fine-grained, global timers
Access to the TBB task-stealing scheduler application programming interfaces (APIs) directly to write your own templates for algorithms if you wish

Most of us will use all but the last block regularly in writing a parallel program with TBB. While most programmers start with just the basic algorithms, they quickly appreciate how TBB offers the other blocks. But why do all other parallel packages lack TBB’s completeness? There are at least four reasons.

First, not all packages try to be portable across operating systems and processors, which would eliminate the need for portable mutual exclusion. Second, not all advanced systems are willing to expose their proprietary schedulers. Third, scalable memory allocators are hard to write and have been sold for significant gains as standalone products before TBB. And fourth, getting key data structures (containers) worked out has been left as a non-trivial task for the programmer instead of the package writer.

Continue to page 2

Coding with TBB Consider a program that creates and outputs an array where each element is equal to the average of three elements of the input array (Output\[i\] = (Input\[i-1\]+Input\[i\]+Input\[i+1\])/3.0;). Here is a function, ParallelAverage, that implements exactly that using TBB.

#include “tbb/parallel_for.h”
#include “tbb/blocked_range.h”

using namespace tbb;

struct Average \{ float* input; float* output; void operator()( const blocked_range& range ) const \{ for( int i=range.begin(); i!=range.end(); ++i ) output\[i\] = (input\[i-1\]+input\[i\]+input\[i+1\])*(1/3.0f); \} \}; // Note: The input must be padded such that input\[-1\] and input\[n\] // can be used to calculate the first and last output values. void ParallelAverage( float* output, float* input, size_t n ) \{ Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range( 0, n ), avg, auto_partitioner() );

The heart of the code is in an operator() of the class we define. Ideally, C++ would offer the ability to put this code inline inside the parallel_for template, but it does not. Other languages offer lambda functions to do that, and this is an example of why it would be nice for C++ to do the same.

C++ does not offer lambda functions, though, so the code for use by TBB in the parallel_for has to be in a class library. This is neat and tidy, and it isn’t hard to understand. But it does represent the only major shift in the original serial code when you switch to TBB.

The right level of abstraction Programmers shouldn’t focus on optimal management of a thread pool, proper distribution of tasks with load balancing and cache affinity in mind, and other issues when they’re working on expressing the parallelism in a program. TBB has the right interfaces to let us focus on our work, and it handles the low-level details very effectively.

Our simple example program would easily take an additional hundred lines of code to write using raw threads. There would be code to create threads, code to destroy thread, code to compute bounds, and code to set up and use locks. Most likely, we would write our raw threads code to figure out the amount of cores and divide up the work evenly.

However, that would be a fatal mistake for “future-proofing.” Fast-forward to a world with a 100-core processor, where some cores are four times as fast as the others. Considering the history of multiprocessor systems, this is one of dozens of possible future scenarios that will invalidate any static breakdown of even such a simple program.

The abstraction offered by TBB (or OpenMP for its domain) solves this problem by handling the decomposition and load-balancing so it’s not hard-coded into our programs. TBB takes this even further by handling nested parallelism, which is well beyond the scope of any simple raw thread implementations we are likely to write in a few hundred lines of code. And none of us would enjoy writing over and over again every time we wanted to do a simple loop in parallel.

Coexisting with other threading models TBB can coexist seamlessly with other threading packages. This is very important because it does not force us to pick among TBB, OpenMP, or raw threads for the entire program. We are free to add TBB to programs that have threading in them already.

We can also add an OpenMP directive, for instance, somewhere else in your program that uses TBB. For a particular part of a program, we will generally use one method. But in a large program, it’s reasonable to anticipate the convenience of mixing various techniques.

It’s fortunate that this is supported by TBB. Using or creating libraries is a key reason for this flexibility, particularly because libraries are often supplied by others. For example, the Intel Math Kernel Library (MKL) and Intel Integrated Performance Primitives (IPP) library are implemented internally using OpenMP. You can freely link a program using TBB with the Intel MKL or Intel IPP library.

Open-Source Availability TBB can be downloaded from http://threadingbuildingblocks.org. Intel initially released it in August 2006, with prebuilt binaries for Windows, Linux, and Mac OS X. Less than a year later, Intel provided more ports (including Solaris, FreeBSD, G5 processor support) and now is working with the community to offer additional ports. Intel still supports TBB as a product and includes it with the professional editions of its compilers. The open-source version and the supported product version have the same features.

TBB offers an efficient and effective manner to program parallelism for C++ (and C) programmers. Available for more than a year and a half now, it has allowed many products, including Autodesk’s Maya, to take more advantage of parallelism than would have been possible without TBB.

In addition, TBB offers an efficient and effective manner to program parallelism for C++ (and C) programmers. It’s definitely worth a serious look, unless you prefer to work harder than you have to—and then suffer again in a few years when tera-scale is here.