It was easy to program applications in the days when one chip, one core were common. Single-chip solutions remain the target of many systems, especially for mobile applications. But these days, they’re likely to include more than one processing core. Programming these platforms can be a challenge.
High-end server platforms like Intel’s six-core Xeon 7460 use lots of transistors for very large, complex architectures. Systems with even more cores on a single chip are readily available as well. Chips like the 40-core Intellasys SEAforth 40C18, the 64-core Tilera TilePro64, and the 336-core Ambric AM2045 are just the beginning (see “Are You Migrating To Massive Multicore Yet?”).
Many PCs already include high-count multicore chips in the form of graphic processing units (GPUs). They’re now being made accessible for general computing and formalized with platforms like Nvidia’s 240-core Tesla C1060 (see “SIMT Architecture Delivers Double-Precision TeraFLOPS”).
Multicore solutions are on the rise because it’s becoming harder to scale single-core processors while trying to maintain the heat and power envelope necessary to make systems practical. Multicore is no longer a scaling issue, but rather a requirement to meet growing performance requirements.
Clock speed and core count don’t tell the whole story, though. Core interconnects constitute the real programming challenge. Many multicore chips don’t employ the shared-memory approach found in symmetrical-multiprocessing (SMP) platforms like the Xeon, where multithreaded applications can typically exist without regard to the number of underlying cores.
Non-uniform memory access (NUMA) architectures maintain the SMP approach. However, scaling to large numbers of cores can be difficult. For instance, the TilePro64 manages with 64 cores on-chip (Fig. 1).
Still, this is one reason why other approaches, such as mesh networks, are employed when cores start numbering into the hundreds or thousands. This allows designers to throw lots of hardware at a problem, though it requires a different approach to programming.
DISTRIBUTED COMPUTING FRAMEWORKS
The OpenMP portable, scalable framework supports multiplatform, shared-memory parallel programming and targets SMP systems. It also supports C/C++ and Fortran and runs on popular platforms such as Linux and Windows. OpenMP is a thread-oriented approach that maps well to existing hardware architectures. Its core elements include thread management, synchronization, and parallel control structures.
The message-passing interface (MPI) standard, maintained at Argonne National Laboratory, can operate on SMP hardware and also span various networks. Several operating systems are based on message-passing communication.
OpenMPI is an open-source implementation of the MPI-2 standard. It can operate over a range of communication systems such as TCP/IP, Myrinet, and most communication fabrics found on multicore processors. OpenMPI also can be mixed with OpenMP.
Intel’s Thread Building Blocks (TBB) are another SMP-oriented framework compatible with OpenMP (see “Multiple Threads Make Chunk Change”). TBB is available as an open-source project as well. Like its name says, TBB is threadoriented, but it tends to utilize one thread per core. Each worker thread gets its work from a job queue. The application feeds the job queues.
TBB extends C and C++ using a limited number of keywords to designate blocks of code that can be performed in parallel. The same is true for data definitions that the parallel code will be working with. These blocks are typically arrays. The data and the processing jobs can be spread across the collection of cores via the worker threads. The queues may fill, but the idea is to keep the cores working instead of idle.
Built around TBB, Intel’s Parallel Studio includes Parallel Advisor (design), Composer (coding), Inspector (debug), and Amplifier (tuning). Parallel Advisor is a static analysis tool designed to identify sections of code in which TBB support will make a difference. It can also identify conflicts and suggest resolutions of these issues. This tool is especially useful for designers who are new to TBB.
Parallel Composer now brings TBB integration to platforms like Microsoft’s Visual Studio. It handles new lambda function support and is compatible with OpenMP 3.0. Parallel debugging support is also part of the package. Its “parallel Lint” capability helps identify coding errors.
Parallel Inspector is a proactive bug finder designed to augment the typical program debugger. It identifies the root cause of defects such as data race conditions and deadlock. The tool can also be used to monitor system behavior and integrity. The system is based on Intel’s Thread Checker tool.
Continue to page 2
Parallel Amplifier utilizes Intel’s Thread Profiler and VTune Performance Analyzers to provide runtime analysis that can help identify bottlenecks. These tools are designed to simplify the use of the profiler and VTune by regular programmers.
PUBLISH OR PERISH?
OpenMPI and OpenMP distribute data for processing by using arrays or communication links, but that isn’t the only mechanism employed in parallel-programming environments, especially those that are more dynamic and apt to change over time. Likewise, fixed buffers, links, and sockets don’t always address environments where the content is known ahead of time, while the suppliers/ publishers and consumers/subscribers are not. A number of implementations exist to facilitate this type of environment.
One is the Object Management Group’s (OMG) Data Distribution Service (DDS). Several commercial versions of DDS are available, such as Real Time Innovations’ RTI DDS, PrismTech’s OpenSplice DDS, and Twin Oaks Computing’s CoreDX. Open Computing’s OpenDDS is an open-source option. Open Computing provides training and support options.
DDS uses a publish/subscribe model familiar to many programmers, but it tends to be built on a much larger environment than a single system (Fig. 2). It has been used for applications ranging from air traffic control to industrial automation.
DDS provides a loosely coupled parallel computing environment. Individual publishers and subscribers are programmed in a conventional sequential programming fashion. Publishers identify the material they provide to the underlying DSS framework, which distributes the data as necessary to subscribers that request such information. In a simplified form, this is how the DDS system operates.
Things get a little more complex when examining the details, though, because options such as quality of service and connection reliability can affect application design. One thing DDS systems do better than most parallel-programming environments is handle transient connections, because they support best-effort delivery. In many applications, it’s sufficient to retain the latest piece of information. Still, DDS systems must deal with many of the scaling and complexity issues of any parallel-programming system.
Microsoft’s Concurrency and Coordination Runtime (CCR) and Decentralized Software Services (DSS) fit somewhere in between. CCR provides scheduling and synchronization within a subsystem. These tools were initially released with the Microsoft Robotics Studio, but have quickly moved to other .NET environments unrelated to robotics (see “Software Frameworks Tackle Load Distribution”).
CCR provides asynchronous and concurrent task management with an eye to coordination and failure handling. It uses its own message passing system. Ports and port sets are the endpoints for messages.
CCR is designed for more tightly integrated connections like OpenMPI. DSS, found on top of CCR, provides a lightweight, state-oriented service model that uses representational state transfer (REST), which is also used on a range of Internet communication. In fact, XML-based communication runs nicely over TCP/IP links, though this isn’t a requirement. The DSS Protocol (DSSP) uses the XML Simple Object Access Protocol (SOAP).
DSS has some publish/subscribe semantics. As a result, it can advertise the availability of a service or piece of information. It also can have any number of controllers utilizing input from realtime sensors.
TURNING GRAPHICS ON ITS SIDE
These parallel programming platforms target general-purpose processing architectures. However, the multicore GPUs found in most high-performance 3D video adapters from companies such as AMD, ATI, and Nvidia are also readily available.
The ATI Stream Processing and Nvidia GeForce and Tesla platforms allow the respective GPUs to find applications beyond just video rendering. Many of these applications are graphics-related. However, several others simply use the hundreds of cores in these GPUs for other computational purposes.
GPU architectures tend to be unique since they were designed for video rendering of 3D games, but they’re general enough to handle other chores. For example, Nvidia’s single-instruction multiple-thread (SIMT) architecture uses thread-processing arrays (TPAs) of eight cores. These cores are grouped in three TPA clusters called thread-processing clusters (TPC).
Nvidia developed a framework dubbed the Compute Unified Device Architecture (CUDA) to handle its SIMT-based GPUs (Fig. 3). CUDA support can be found in the company’s latest device drivers, so any PC equipped with one of its GPUs is a potential supercomputer—well, at least a little supercomputer. CUDA programs are written in C. Other programming languages like Fortran and C++ are also being added to the list.
CUDA hides much of the underlying complexity of the SIMT architecture. In fact, it’s been generalized so that it can address almost any memory-based multicore platform. CUDA now supports the Khronos Group’s OpenCL. The Khronos Group is a member-funded consortium that supports open standards such as OpenCL and OpenGL. OpenGL is a 2D and 3D graphics application programming interface (API).
Continue to page 3
Open Computing Language (OpenCL) is a standard for parallel programming that supports, but is not restricted to, GPUs. It even supports IBM’s Cell processor (see “CELL Processor Gets Ready To Entertain The Masses”) found in Sony’s PlayStation 3 and DSPs.
OpenCL can handle a heterogeneous environment. Therefore, a mix of x86 chips, GPUs, and DSPs could merrily crunch on loads of data. It has garnered wide support, so this scenario is actually feasible. It can even fit on mobile platforms.
Also, OpenCL has a platform model with a controlling host and multiple compute units. The compute units execute kernels, which are small chunks of code. This model is seen elsewhere with Nvidia’s SIMT architecture as well as Intel’s TBB.
Further, OpenCL uses a relaxed memory consistency model. It doesn’t guarantee consistency of common variables across a collection of workgroup items, unlike an SMP system, where a variable has one location that’s equally accessible by any core. This is because many of the target platforms feature distributed memory with a core often having its own local memory.
OpenCL puts some limitations on the programming model. For example, pointers to functions aren’t allowed. Data pointers within a kernel block are allowed, but they may not be an argument. The restrictions make it possible to transparently map the application to the wide range of architectures supported by OpenCL.
Frameworks like OpenCL are likely to be adopted to support new hardware architectures. But initially, vendor-provided programming tools will often be the first step. Likewise, some architectures work best when the developer can exploit features within the architecture through the programming tools designed to work in the architecture.
One such example is Forth programming support for the 40-core Intellasys SEAforth 40C18. Each core has only 512 words of RAM and ROM. Each 18-bit word contains four instructions. Unlike some other multicore solutions, the SEAforth cores aren’t designed to run one large program. Instead, they run very small, cooperative programs. In fact, three cores can be used to handle the dynamic RAM interface.
The XMOS XS1-G4 has hardware scheduling of up to eight tasks per core with four cores per chip. The hardware scheduling makes it easy to write drivers for soft peripherals or handle the hard interfaces such as 32 XLink channels. These are used for communication between cores and chips.
Channel communication is so ingrained in the system that the XC compiler, an extended version of C, brings channels into the base language. Communication is explicit, but XMOS uses a basic part of the approach for parallel programming.
Parallel programming on SMP architectures deals with virtual memory, pointers, and multithreading facilities that have been commonly used for decades using languages like C, C++, and Java. Network cluster programming using TCP/IP and sockets has also been prevalent.
These programming techniques can be used in many core environments. However, explicit control and communication can make programming tasks in these environments difficult as the number of cores increases. One area in which many cores make fast work is array computation.
Programming languages like the Mathworks’ Matlab offer matrix manipulation support. Many matrix computations map very well to a range of hardware architectures, though some architectures handle some operations better than others. For example, SMP architectures in which cores have simultaneous access to all memory can easily handle random access operations, versus architectures with just local memory.
These architectures have a high latency for accessing information that isn’t local, making operations like matrix inversion a challenge. This is one reason why GPUs and clusters of cores can handle some algorithms exceptionally well while others will work very poorly.
Matlab’s array-processing support is something any runtime can provide. So while this approach is applicable to any programming language, it only addresses some parallel-programming chores. For other chores, there’s the Parallel Computing Toolbox.
The Parallel Computing Toolbox adds features such as parallel for loops, distributed arrays, and message-passing funcintersiltions. Message-passing functions address MPIstyle programming, but the other features highlight the deficiencies of conventional programming languages. Adding these types of parallel computing services illustrates how programming languages are changing.
Continue to page 4
In scatter-gather, a typical parallel-programming pattern, data is distributed for processing. Then the results are gathered together, often with additional processing, to combine the results. This dataflow control can be a challenge for conventional control flow languages, but it’s second nature for National Instruments’ LabView.
The LabView graphical programming language also is a dataflow language with which programmers specify how data moves through the system (here, sequencing is a secondary issue). Not to say that sequential programming isn’t part of LabView. In fact, loop and conditional constructs will be part of any LabView program.
Many designers will be interested in how LabView works under the hood on a conventional processor. In the simplest case, pending operations are placed in a job queue. A thread reads the queue and performs the operation, potentially posting new jobs in the queue.
This scenario is the same used by Intel’s TBB. As with TBB, there may be multiple worker threads. The number of worker threads tends to match the number of cores. Fewer of them will avoid the hardware. More tend to result in idle threads.
Asynchronous I/O doesn’t delay the working threads. Instead, an entry is added to the queue when a background operation is complete.
In theory, job distribution and processing can be handled by a large number of cores, potentially using other hardware like GPUs. National Instruments is researching these areas now—dataflow semantics allow LabView to target more than conventional single-core and SMP platforms.
FPGA application design is naturally parallel. It also works well with graphical design tools, so it’s no surprise that Lab- View applications target FPGAs. LabView applications can be split across FPGAs and computing platforms.
Graphical dataflow languages like Lab- View aren’t common, though a few are available, such as the Mathworks’ Simulink and Microsoft’s Visual Programming Language (VPL).
The dataflow approach can be seen as a message-passing model. Implementations like LabView operate with a fine-grain resolution. Move to a coarser level of control, and the actor model emerges. Actors are objects that receive and send messages. They tend to be components rather than complete applications.
Ambric’s Am2045 Massively Parallel Processing Array 336-core chip is programmed using Java. Restrictions exist, primarily on size, because of the memory resources available within a core. Essentially, Ambric implements a messagebased actor model. Each core executes an active object/actor with messages being sent and received using a straightforward channel interface.
Actors and parallel programming are old friends. Programming languages like Erlang have been used to implement robust distributed applications. Erlang was originally designed by Ericsson with an eye toward fault tolerance.
Scala is a newer programming language that addresses the actor model (see “If Your Programming Language Doesn’t Work, Give Scala A Try”). It originally was designed to run atop a Java virtual machine. Scala also implements the functional programming model.
Functional programming is a bit more than just calling functions. It’s a programming model that avoids state and mutable data. This turns out to be good for parallel execution, but is at odds with most conventional programming languages where variables are designed to be changed at will.
Languages that incorporate functional programming aspects, such as Scala, are considered impure functional languages because variable values can change. Pure functional languages can’t modify the contents of a variable. At this point, functional programming is often more of an academic than commercial concern when it comes to implementation.
One advantage of a pure functional programming language is its referential transparency. Calling any function with a set of parameter values will always generate the same result. This is true for many of the functions implemented in conventional languages where it’s possible to use a functional programming style. However, that’s the case only if they don’t retain any state or access information outside of the function that can change.
Continue to page 5
Guaranteeing that a function always returns the same value for a set of parameters means the code can be replicated. This type of distribution will be critical as the number of cores rises to the thousands and system-wide shared memory becomes a special case rather than the norm.
Likewise, unchanging variables means distribution of data can occur by copying information without regard to its source. This is akin to data that’s transmitted via a message-passing environment.
Unfortunately, programming with a pure functional programming language isn’t easy. This is especially true with a programming background in a non-functional programming language. One of the more notable pure functional programming languages is Haskell, which is named for Haskell Curry, a mathematician and logician.
The Haskell language appeared in the 1990s. Its features include pattern matching, single assignment semantics, and lazy evaluation. Lazy evaluation allows a list to be returned as a result from a function call but where the contents have not been generated.
The value of the list entries is computed when they are evaluated. This leads to the concept of an infinite list. It’s similar to generator functions or objects found in conventional languages such as C++. However, the next value isn’t returned through an explicit function call but rather when a value is evaluated.
Monads are an interesting abstract data-type concept that Haskell supports to address I/O, typically an area where side effects are common in conventional programming implementations. Monads are similar to lazy infinite lists, as they generate information on demand. Monads are object/method-oriented in implementation, though, making them easier to use in many instances.
While functional programming can be challenging, it can have significant benefits for parallel programming.
Debugging needs to be addressed regardless of the parallel programming approach. Existing debuggers are simply the starting point, because most don’t address many of the features inherent in parallel programming, such as messaging, data distribution, and loading.
Tools like tracing, profiling, and optimizers will need to handle lots of data as well as provide insight into the parallel nature of the application. Tools created in academia are moving quickly to the production side. Real-time monitoring tools and declarative debuggers are just some areas where new ideas can come into play. Parallel programming will play an important role in taking advantage of the multicore hardware that’s being delivered.