Electronic Design

The Multicore Era Seeks A Parallel Paradigm

Scalability, simpler debugging, and easier coding are essential to developing a successful parallel-programming approach.

Parallel programming is hard. But debugging it is even harder. Unfortunately, taking advantage of multicore solutions like Intel’s 80-core TeraScale prototype will require some type of parallel-programming technique (Fig. 1).

The first challenge is to find parallelism that can be exploited. The next is using a tool to exploit the parallelism. Another goal is bug-free code. Parallel programming opens the door to a range of more complex bugs, though, and time becomes even more critical. Finally, there’s the issue of targeting the host platform with these tools.

At this point, generic solutions don’t exist because of the range of multicore hardware. Tools primarily target only one class of hardware or even one vendor’s hardware. Programmers typically push these jobs off to the operating system or runtime. Eventually, though, parallel-programming constructs will make it into mainstream programming languages. Either way, developers will need multicore solutions to take advantage of performance improvements, since singlecore scaling is no longer an option in pushing the limits.

Pushing the job of managing coarse-grain parallelism onto the operating system is a common task and easy to do. It works well if there’s a large number of programs, or if those programs are taking advantage of multiple cores. This requires no modification of the applications, but it’s of less value if there isn’t enough programs to exploit the hardware.

Server environments typically can have program loads that use the target hardware. Likewise, embedded application designers can latch onto virtual-machine (VM) products like Trango’s Hypervisor, Green Hills Software’s Integrity, VmWare’s namesake, and KVM or Xen on Linux to manage multicore solutions. These tools allow for better management and debugging of programs and systems in addition to providing features like load leveling.

VM architectures potentially open up other avenues for programmers. Thin operating systems or programs running alone in a VM may be given access to features previously restricted to the operating system, such as virtual memory management and peripheral access.

Virtual memory management could enable programmers to manage memory and interprocess and intra-application communication more effectively. For multicore utilization, communication is key to good use of the system. The big question is whether programming languages or runtimes will take this approach.

After VMs, runtimes are the most common method for exploiting multicore environments. Platforms like Intel’s Threading Building Blocks (TBB) require developers to explicitly use exposed function calls to utilize the runtime.

This approach forces developers to determine the type and utilization of parallelism in an application and meld it with the runtime. In turn, the runtime will also need to manage parallelism. The functional interface can help narrow the scope for finding parallelism that may put the onus on the programmer to use the right function.

Usually, the interface is implemented to the runtime strictly through function or class definitions, though customizing a compiler offers advantages as well. TBB employs a typical interface, much like the following definition for the parallel_do function:

template<typename InputIterator, typename Body> void parallel_do( InputIterator first, InputIterator last, Body body );

In general, parallel processing deals with data or control parallelism. The above definition takes advantage of TBB’s C++ support and C++ templates. Specifically, TBB addresses data parallelism over large data sets, such as matrices or streams of data.

Microsoft’s Concurrency and Coordination Runtime (CCR) (see “Software Frameworks Tackle Load Distribution” at www.elecronicdesign.com, ED Online 18813), which was released with Microsoft’s Robotics Studio (see “MS Robotics Studio,” ED Online 16631), also uses a functional interface and addresses control parallelism. In this case, CCR helps optimize asynchronous communication between threads that may be distributed among multicore platforms or even across networks.

As with any runtime, programmers must account for a mindset and an underlying architecture. They work with it all the time, since applications rarely are completely standalone or written solely by a single programmer. Consequently, there’s at least some level of black-box isolation within an application. On the other hand, complex frameworks like TBB or CCR require a good understanding of the underlying architecture.

Continue on Page 2

Putting an additional level between the programmer and the base system sometimes can help, too. This is the case with Microsoft’s PLINQ (Parallel Language Integrated Query) technology, which is an extension of LINQ. PLINQ and LINQ are designed to simplify access to data sources such as SQL servers.

The difference between PLINQ and LINQ and SQL or other interfaces like XPath and XQuery is that PLINQ is a data-source agnostic, type-safe query language that’s embedded in a number of Microsoft’s .NET-based languages (such as C#). Since database use is ubiquitous in many applications, improving parallel performance can significantly boost performance.

Again, finding parallelism is a cooperative process with programmers needing to know what functions to utilize. The advantage for programmers is that they only need to learn a single query language regardless of the data source. PLINQ was designed to maintain the programming model provided by LINQ while offering additional parallel functionality.

Integrating LINQ/PLINQ functionality within the compiler has advantages in the sense that syntactic changes are easier. It wreaks havoc on portability, though, limiting the solution to Microsoft platforms. New approaches like this also mean fighting conventions like SQL with new syntactic ordering such as:

var q = from x in Y where p(x) orderby x.f1 select x.f2;

As with most programming syntax, one person’s sugar is another’s salt. Still, being able to completely embed the solution with a programming language can simplify a programmer’s job of learning a system, and parallel constructs won’t be utilized if they’re hard to use or remember.

Of course, playing with syntax and semantics does allow compiler and systems designers to add features that would otherwise be hard to incorporate by staying strictly within the bounds of a current programming language definition. For example, PLINQ adds the idea of lazy evaluation in the form of infinite streams.

Using a stream within a query lets the system access only those items needed to complete the current transaction. A simple example would be a stream query that has results being returned one at a time. If the stream already supplied the data when a result is requested, then the application continues. Otherwise, it waits and the calculation of the next stream element occurs.

PLINQ provides a range of parallel-processing enhancements, such as the ability to run multiple threads on a partitioned data space as well as pipelining requests. Of course, each enhancement has its own issues, such as whether physical or temporal locality of data is critical to the application or the operation being performed.

Likewise, partitioning queries can have a major impact on the resulting performance and efficiency (Fig. 2). As the number of cores, threads, and communication methods increases, so does the number of options. And regardless of whether you’re using TBB, CCR, or something else, it’s difficult to get the costs right.

The number of cores in a system may be large, but runaway computation can waste such a resource. This may not even be apparent from a user’s perspective, since a result may be delivered in a timely fashion. But developers will need more insight, including more time-oriented diagnostics.

Mainstream languages like Basic, C, C++, C#, and Java include multithreading support. However, all thread and data management is explicit. They form the basis for the parallel runtimes, but runtime designers often perform some interesting feats that most programmers would rather forget or not even want to learn about.

Research projects like Unified Parallel C add to the syntax and semantics of an existing language. Still, programmers loathe incorporating new changes unless they can see widespread adoption, or if a particular platform they must use supports the tools.

Another issue is the existing infrastructure and semantics for most of the mainstream languages. For example, shared memory is the norm. Yet it’s a concept that doesn’t scale well, while pointers and references are central to languages like C or Java.

Several different approaches, such as using futures for lazy function evaluation, are similar to the PLINQ infinite stream example noted earlier. This approach is commonly used in functional programming languages like Miranda and Haskell, though these examples definitely aren’t mainstream.

Continue on Page 3

Likewise, Scheme, a dialect of Lisp, employs the functions delay and force to implement the idea of futures. A function’s computation can be delayed until a result is forced, though it may only be the part of the result that’s being examined. If the result is a list and the value of the first item is forced, then only that item needs to be computed.

This approach as well as other parallelprogramming methods such as scattergather are used in a range of applications already, from database servers to disk queuing to memory caching, with the well-accepted look-ahead methods. These features need to be incorporated into programming languages, but deciding how and when is a difficult task. Various features do find their way into the mainstream eventually. For example, lambda expressions are cropping up in C# and Java (see “Lambda: Reclaiming An Old Concept,” ED Online 18099).

While researchers may be scurrying to move parallel language enhancements into the mainstream, some platforms are already there. National Instruments’ Lab- VIEW has been supporting parallel dataflow semantics since its inception, as well as time-based programming aspects that blend well because of LabVIEW’s graphical nature.

LabVIEW isn’t the only graphical programming language that supports dataflow semantics, but it’s one of the more mature products. It brings parallel processing semantics down to the graphical statement level. In fact, LabVIEW tends to push parallel processing to the other extreme, where hundreds of expressions may be pending evaluation.

Prioritizing computation tends to be more difficult compared to sequential textbased programming languages, but that’s the tradeoff. Every language has its own advantages and disadvantages, and none of them—not even LabVIEW—answers all problems equally well.

One aspect handled well by National Instruments with its LabVIEW implementation is splitting a model/program across platforms. This is critical for parallel programming because many architectures are hybrids with multiple instances of multiplecore platforms. Multiple-core platforms are normally linked by shared memory while instances are normally linked using other techniques. Many other approaches tend to fall down in this area because they address only a single architecture.

TAGS: Intel
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.