Programming multicore systems can be tough on the developer, who potentially needs to understand a zoo of unfamiliar architectural designs, porting and optimizing application code for each one. The problems of multicore (or, increasingly, many-core) development are particularly apparent on emerging non-uniform memory architectures (NUMAs). These are truly multicore, in the sense that individual cores have separate memory spaces. Such is the nature of the IBM Cell processor (at the heart of the PS3) and various modern accelerator cards.
The approach taken by the CodePlay Sieve C++ Parallel Programming System involves a high-level abstraction of parallelism, implemented using a small set of C++ language extensions. Code to be parallelized is marked using a sieve block, and split statements are inserted (explicitly by the programmer, or implicitly by the system) to indicate how the workload should be spread across multiple cores.
Support is provided for data parallelism (via SIMD instructions and multicore loop parallelization techniques) and task-level parallelism, which helps accelerate more complex algorithms. The language extensions are subtle enough to be quickly adopted by C++ developers.
Efficiently managing data is a real issue when working with separate memory spaces. To avoid unnecessary latency, multiple cores must be given data by the host processor or designated master processor before it’s needed.
Deciding when and which data to transfer is a tricky, architecture-dependent problem. Efficient solutions involve smart double-buffering and pipelining techniques. The issue is complicated further by the nature of DMA systems, the characteristics of which may differ between cores in a heterogeneous architecture.
For best performance, NUMAs like the Cell require data to be pulled by individual SPEs. On the other hand, accelerator card-based NUMAs typically respond better to the pushing of data on to remote processing elements by the host processor.
For a developer using Sieve, these issues disappear. The Sieve language extensions are entirely hardware independent, and architectural issues are handled transparently by the Sieve runtime. This means application code developed via Sieve is portable across the range of architectures that support a Sieve runtime.
As the number of cores available in a standard PC steadily rises, and with massively parallel architectures on the horizon, developers can’t afford to customize an application for every new configuration of cores. Sieve is one step ahead in solving the scalability problem.
A Sieve program compiled for a given processor doesn’t even need to be recompiled to run on another processor from the same family (with a different number of cores). This is achieved by avoiding, in the language, explicit references to the number or type of cores in the underlying architecture.
Deadlocks, race conditions, and non-determinism are well-known enemies to developers of concurrent software. A key feature of the Sieve design is determinism. A C++ program, annotated with Sieve blocks, runs the same on multiple cores as it would on a single core. The “sieved” part of an application therefore doesn’t suffer from such problems and can be developed, tested, and debugged on a single-core machine while reaping the benefits when deployed on a multicore architecture.
While we’ve seen lots of success speeding up algorithms on a variety of parallel systems, we’ve also found that “going parallel” is generally unpredictable. Whether or not code will run faster, and how great the speedup will be, is governed by several interacting phenomena, including memory bandwidth, cache, and DMA scheduling.
Deciding when to parallelize is also an issue. Users may see only a small performance increase (perhaps even some performance degradation) if parallelism is requested at too fine a level of granularity. Our multicore profiling toolkit can help here, illustrating where parallelism is working well, and where it is not.
A particularly interesting issue arises on shared-memory machines, where multiple cores simultaneously work on distinct data. This can result in strange cache performance, sometimes resulting in a lesser speedup than anticipated due to an increase in cache misses. However, it often drastically improves performance due to an boost in cache hits. On a dual-core machine, one doesn’t expect to get more than a twofold speedup.
Yet for certain kinds of problems (e.g., image convolution), we’ve seen performance increase by a factor greater than the number of cores. Profiling reveals that, by happy coincidence, the multiple cores are using shared cache extremely effectively. One of CodePlay’s aims is to further develop the Sieve system to deliberately exploit this kind of caching behavior.