Get Ready For Some Hard Work With Multicore Programming

1 of Enlarge image

Fig 1. Adapteva’s Epiphany 32-bit cores are linked via a rectangular mesh.

Fig 2. Intel’s 22-nm chip technology is used with its Knights Corner Many Integrated Core (MIC) platform.

Fig 3. The Knights Corner platform ties Intel Architecture (IA) cores via a high-speed loop.

Clusters with thousands of cores are standard fare these days. Chips with hundreds of cores can be found in the average desktop and laptop PC in the form of GPUs that are now being used for computational chores in addition to their graphical work.

Programming the current crop of server, desktop, and laptop systems is relatively straightforward. However, things get a bit more challenging as the number of cores increases by an order of one or two magnitudes. For example, servers may consist of a few multicore chips providing a symmetrical multiprocessing (SMP) system with dozens of cores supporting virtual machines (VM) allowing a system to run hundreds of VMs.

In the past, a high-speed bus or switch system would link multiple cores within a chip. But the latest architectures tend to take a non-uniform memory architecture (NUMA) approach. Chips typically have a significant amount of on-chip memory often in the form of L1, L2, and possibly L3 caches. Intel, AMD, and other companies have incorporated multiple memory controllers alongside the cores as well.

On-chip cores have fast access to their L1 caches with longer latencies as data is located farther from the core. Off-chip access requests are routed through the chip interconnect. Intel and AMD utilize multiple point-to-point connections between chips and route requests through adjacent chips if they do not have the desired information.

Unfortunately, moving to a very large number of cores requires different approaches, such as a mesh or a more general network/cluster technique. Network interfaces like Serial Rapid IO (SRIO), InfiniBand, and Ethernet have been used to implement clusters. Message packets handle communication. Latency tends to be high compared to on-chip communication.

SRIO has the lowest overhead and smallest packets, and it provides end-to-end handshaking. InfiniBand tends to handle larger messages and has been the choice for many supercomputer clusters. Ethernet has significant overhead and latency issues, but it is ubiquitous and the backbone for the Internet. All three address off-chip communication, but on-chip the designs are more varied.

SMP NUMA Chips

Many companies are working on research chips with hundreds or thousands of cores, and a few are delivering them now. Tilera’s 64-bit Tile-Gx line suits communication applications, while the company’s latest products address the cloud and server clusters (see “Single Chip Packs In 100 VLIW Cores” at electronicdesign.com). Adapteva’s 32-bit Epiphany (Fig. 1) architecture addresses embedded signal processing (see “Multicore Array Targets Embedded Applications” at electronicdesign.com).

Tilera and Adapteva both pair each processing core with a communications switch that is part of a communications mesh. Each has its own implementation and features, but they are similar as the mesh consists of multiple communication networks for carrying different types of information.

Tilera also includes caching support in this network because it implements a sophisticated cache coherency mechanism that allows cores to be grouped together into a “computational island.” These cores permit isolation so a single chip can run many operating environments at one time. Access to memory within one of these regions is the same regardless of the cores involved since the caches are distributed given the additional latency to access data associated with another core.

Adapteva’s cores have similar access latency issues and capabilities, but the cores do not have any local cache. Likewise, all cores have access to all memory within the system. Each core has 32 kbytes of storage shared between code and data. This approach puts more responsibility into the hands of the programmers, but it greatly simplifies the chip design. It also reduces complexity and power requirements. The Epiphany cores target embedded applications where a designer normally will be managing the entire system, and details such as security do not arise.

These two approaches highlight just some of the challenges that programmers encounter when dealing with a large number of cores. These systems are only efficient if most of the cores are doing useful work. A sequential algorithm running on a single core gets no advantage from its neighbors.

Intel’s Larrabee (see “Intel Makes Some Multicore Lemonade” at electronicdesign.com) has led to Intel’s 22-nm Knights Corner Many Integrated Core (MIC) platform (Fig. 2). Knights Corner has 32 1.5-GHz cores that support quad threading, providing support for 128 threads. It also provides cache coherency among the cores using a loop that was used in Larrabee. The loop connects to the Graphics Double Data Rate version 5 (GDDR5) memory controllers, I/O, and some fixed-function logic (Fig. 3).

Shared memory can be used for interprocess communication and to implement message passing systems. On the other hand, some hardware designs implement message passing in hardware, which can simplify software design. These systems tend to be easier to grow when multiple-chip systems are created.

Message Passing Chips And Systems

A message-passing mesh network ties together the 24 tiles on Intel’s Single-chip Cloud Computer (SCC). Each tile has a five-port router with one port dedicated to the two x86 IA cores. The two cores share a 16-Mbyte message buffer. Each core has its own L1 and L2 cache. Four Double Data Rate 3 (DDR3) memory controllers are spaced around the communications mesh.

The mesh network has a bandwidth of 2 Tbytes/s. Like Adapteva’s Epiphany, the SCC uses a fixed X-Y routing mechanism. The difference is that Epiphany works at a word level while Intel’s chip works at the message level. The SCC routers support two message classes and implement eight virtual channels.

SeaMicro didn’t put lots of cores onto a single chip but rather built a single system out of 256 dual-core, 1.66-GHz Intel Atom N570 chips (see “512 64-Bit Atom Cores In 10U Rack” at electronicdesign.com). A 1.28-Tbit/s toroidal fabric connects these 64-bit chips. Message passing is used to communicate between chips. The fabric also connects the processors to SATA storage controllers and Ethernet controllers.

Multiprogramming approaches

Parallel programming approaches are varied and sometimes depend upon the underlying hardware architecture. Many approaches require shared memory, while message-based solutions can be mapped to a range of architectures.

Message-passing interfaces (MPIs) include basic sockets available with most networking platforms. In their most basic form, they are character- or record-oriented pipes between tasks. Extensible Markup Language (XML) has changed the way data is shipped around by eliminating many of the issues with binary data at the expense of more overhead.

The MPI Forum provides an MPI standard library that is often used for task-level parallel programming communication. It was designed to work with everything from clusters to massively parallel machines.

The MPI specification includes processes, groups, and communicators. A process can belong to one or more groups. A communicator provides the link between processes and includes a set of valid participants. An intracommunicator works within a group. An intercommunicator works across groups. Communicators are created and destroyed dynamically.

OpenMP (open multiprocessing) is a shared-memory application programming interface (API) that’s often used with SMP systems. Support is required in the compiler, and OpenMP support is available for a range of programming languages like C, C++, and Fortran. It is simpler than MPI but lacks fine-grain task control mechanisms. Also, it doesn’t support GPUs.

The Multicore Association (MCA) offers specifications that are designed to provide portability between operating systems in addition to interoperability between systems. Its OpenMCAPI (open multicore communications API) facilitates basic multicore shared-memory communication.

Released in 2008, the MCAPI 1.0 specification defines a lightweight interprocessor communication (IPC) system that supports a range of environments including SMP, asymmetrical multiprocessing (AMP), clusters, and even ASICs and FPGAs.

MCAPI provides rudimentary message passing support, allowing developers to build more sophisticated systems on top. It can take advantage of shared-memory systems as well as other architectures. The MCA Multicore Resource Management API (MRAPI) complements MCAPI, providing application-level resource management. It is applicable to SMP and AMP systems.

Shared-memory systems lend themselves to matrix operations, and multicore platforms like GPU work well. The MathWorks’ Matlab is well known for its numeric computational support, which can now take advantage of GPUs in addition to multicore SMP systems (see “Mathworks Matlab GPU Q And A” at electronicdesign.com). Matlab can hide the underlying computation system. Some Matlab applications can get significant acceleration by running on GPUs, but many will run equally well or better on CPUs.

GPUs opened up to non-graphical display applications with CUDA for NVidia’s GPUs (see “SIMT Architecture Delivers Double-Precision Teraflops” at electronicdesign.com). NVidia’s CUDA supports programming languages like C, C++, and Fortran but, as noted, only for NVidia’s hardware.

Kronos Group’s OpenCL (open computing language) is a more generic solution that runs on a wide range of GPUs and CPUs (see “Match Multicore With Multiprogramming” at electronicdesign.com). It provides a similar architecture and functionality as CUDA.

In OpenCL terminology, a kernel is a function that can be run on a device, like a GPU, that can be called by a host program. It is akin to a remote procedure call (RPC). OpenCL can run on a CPU that is also the host, but typically OpenCL is used to take advantage of other computing resources.

An OpenCL kernel has an OpenCL program associated with it. The work done by a program is defined within the context of an ND-Range that contains work-groups that in turn contain work-items. Work-items are the same as CUDA threads. The OpenCL model is based on a single-instruction multiple-thread (SIMT) architecture found in GPUs.

Work-items perform computations. The work-group defines how work-items communicate with each other. The SIMT approach has many threads working on different data but running the same code. Not all applications map well to this architecture but many do, allowing GPUs to improve performance by a factor of 10 to 100 compared to CPUs because of the hundreds of cores found in a GPU.

A host application defines all the components using the OpenCL API, including memory allocations on the device, and supplies the kernel program, which is offered in source form and compiled at runtime. This approach provides portability, and the overhead is minor since calculations are typically being performed on large arrays of data.

Likewise, the compilation is only performed once when the kernel is set up, not on each call to a kernel. Parameters sent to a kernel are queued, enabling the host to provide data asynchronously. Results are read from the queue messages.

CUDA and OpenCL are relatively new to parallel computing and oriented toward data parallism. The Object Management Group (OMG), also purveyors of the Unified Modeling Language (UML), has done a lot to standarize parallel computing.

UML can be used to define parallel processing, but most programmers are likely to be using tools based on other OMG specifications like the Common Object Request Broker Architecture (CORBA) (see “Software Frameworks Tackle Load Distribution” at electronicdesign.com) and Data Distribution Service (DDS) (see “DDS V1.0 Standardizes Publish/Subscribe” at electronicdesign.com).

Distributed Tasks

CORBA provides a remote procedure call environment designed to work in a heterogeneous network of cooperating tasks. It uses an interface definition language (IDL) to specify how data is mapped between systems, allowing the underlying system to marshall data that may have different formats, compression, or encoding such as Little Endian versus Big Endian integers. Standard mappings are provided for all major programming languages like Ada, C, C++, Java, and even COBOL and Python with non-standard implementations for many other languages.

Also, CORBA uses object request brokers (ORB). An ORB provides a program access to a CORBA network of ORBs. ORBs additionally provide a range of services including directory services, scheduling, and transaction processing. ORBs from different sources can work together.

CORBA implementations tend to be large, addressing complex applications. Several systems provide similar services such as Microsoft’s DCOM (distributed component object model) and .NET Framework or Java’s Remote Method Invocation (RMI). Some approaches provide a lighter-weight environment more amenable to embedded systems. Others provide a hierarchical implementation that allows low-end nodes to be part of a more complex environment.

The OMG DDS represents a class of publish/subscribe approaches to data distribution that differs from the RPC approach. With DDS, a source will publish information as topics, and any number of subscribers can access these topics. Data is provided as it becomes available or is changed.

DDS can be complex like CORBA because of the range of issues involved including details like marshalling, flow control, quality of service, and security. Other publish/subscribe interfaces are often less complex than DDS because they are designed to work with specific operating systems or languages. Likewise, many approaches mix parallel communication methodologies together.

Microsoft’s Decentralized Software Services (DSS) is a lightweight, REST-style (Representational State Transfer), .NET-based runtime environment that was originally part of the Microsoft Robotic Developer Studio (see “Frameworks Make Robotics Development Easy—Or Easier, At Least” at electronicdesign.com). Of course, lightweight is a relative term as DSS is quite sophisticated and complex.

DSS is designed to run on top of the Concurrency and Coordination Runtime (CCR) that was also part of the Robotic Development Studio. It turns out that DSS is quite useful for other applications as well. DSS is Web-based, so a DSS node has a Universal Resource Identifier. DSS provides a mechanism to dynamically publish and identify services.

For robotics, another alternative to Microsoft’s DSS/CCR is the open-source ROS, Robot Operating System (see “Cooperation Leads To Smarter Robots” at electronidesign.com). ROS is also Web-based, but it moves many of the issues such as security to external network support.

Parallel Languages

Programming languages like the MathWorks’ Matlab provide parallel operation when dealing with matrix operations that can take advantage of GPUs if available. The Parallel Computing Toolbox adds language features like parallel for-loops, special array types, and parallelized numerical functions.

Intel’s Parallel Studio follows a similar approach (see “Dev Tools Target Parallel Processing” at electronicdesign.com). Its features include Thread Building Blocks designed to express C++ task-based parallelism (see “Parallel Programming Is Here To Stay” at electronicdesign.com). The latest incarnation offers functions like parallel_pipeline, which provides a strongly typed, lambda-friendly pipeline interface.

Another part of Parallel Studio is Intel’s Cilk++ software development kit (SDK). Cilk++ is a parallel extension to C++ via the Parallel Studio C/C++ compiler. It adds keywords like cilk_for, cilk_spawn, and cilk_sync. Like most parallel programming language extensions, Cilk++ is designed to allow low-overhead creation, management, and synchronization of tasks. The runtime supports load balancing, synchronization, and intertask communication.

Some programming languages like Java already incorporate advanced task management. Ada is one well established programming language that should be considered when looking into parallel programming. It has excellent task management support, and its use in applications requiring a high level of safety and reliability highlights why it can help in parallel processing applications that tend to be large and complex.

Ericsson initially developed the Erlang programming language for telephone management applications. It uses an actor model for concurrency. Erlang also provides automatic garbage collection like Java. A functional language subset has strict evaluation, single assignment, and dynamic typing. Functional programming languages have many benefits when it comes to parallel programming.

Haskell is one of the premier functional programming languages (see “Embedded Functional Programming Using Haskell” at electronicdesign.com). Parallel programming support is in a state of flux as Haskell is also used for research. One approach employs “strategies” that can be tuned for a particular application or host.

Haskell’s Hindley-Milnet global type inference provides many advantages in dealing with parallel algorithms, but functional programming’s single assignment (immutable variables), lazy evaluation, and use of monads will bend the minds of programmers used to C/C++ and Java.

Google’s Go and Scala, two languages designed from scratch along more conventional lines compared to Haskell, provide a concurrent programming environment (see “If Your Programming Language Doesn’t Work, Give Scala A Try” at electronicdesign.com). They draw some aspects from functional programming as well as other programming languages. Scala is not a superset of Java, but it is close.

Go includes lightweight threads called goroutines that communicate using pattern matching channels. Scala runs on a Java virtual machine (JVM). It includes parallel collections that have a collection interface but can process the contents using parallel semantics.

National Instruments’ LabVIEW is one of the few dataflow programming languages (see “LabVIEW 2010 Hits NI Week” at electronicdesign.com). Dataflow semantics allow computations to occur in parallel, and LabView applications map nicely to parallel hardware including FPGAs and GPUs. LabVIEW’s graphical nature tends to make the parallel operations more apparent to programmers.

There are many choices when it comes to parallel programming. Incremental enhancements like Cilk++ or OpenCL may be suitable in many instances, but developers should not overlook more radical changes.