Multiprocessing Software Tackles The 21st Century

In the best of all worlds, programmers would write applications without thinking about the target systems on which they'll run. Processor dependencies such as cache size, memory bandwidth, and others would be ignored. Multiprocessor (MP) architectural dependencies like memory sharing, number of processors, and network bandwidth also would be forgotten. And of course, tools would be supplied to automatically provide an efficient executable with minimal effort.

Today, available compiler technology can do a good job of optimizing for processor architectures. Compilers even deal well with shared-memory multiprocessor systems. But in the next century, we'll have to develop the tools to enable this level of simplicity on distributed-memory multiprocessors. Efforts are under way to ease the use of data-flow libraries. There's even a group working on standardizing this capability, which will result in application portability. The real challenge is in automatically decomposing a single application to run on multiple processors. Research has started, and usable solutions shouldn't be too far away.

One thing is sure—the development of multiprocessor architectures continues to increase dramatically. At the high end, SGI/Cray, Sun, IBM, HP, Compaq, and other vendors are all producing MP systems based on their workstation technologies. The availability of high-performance networking lets users build their own systems. There are a number of firms in the embedded area who supply multiple processors on a single board. A few vendors, including SKY Computers, are able to efficiently connect many multiprocessor boards into high-performance-computer (HPC) systems.

Multiple Solutions, Challenges Reading about multiprocessor architectures, however, is like alphabet soup. There are symmetric multiprocessors (SMPs), nonuniform memory access (NUMA), network of workstations (NoW), distributed shared memory (DSM), and others. There are tradeoffs in each of these architectures, and there isn't any single best solution for all MP requirements. Yet all multiprocessor systems, regardless of architectural proclivity, present significant software challenges.

Software Issues: The first issue to be addressed is maximizing the performance from each processor. The most common reason for purchasing an MP system is that a single processor doesn't have the CPU or I/O performance to solve problems fast enough. If the application can be tuned to run faster on every processor, the number of processors can be reduced, thereby decreasing the MP system's size and complexity.

Developing a multiprocessor application is seldom simple or transparent. Vendor-supplied libraries are available to support specific hardware features. They also may include shared-memory functions, a multithreading library, or a message-passing communications library. Sometimes, users can "flip a switch" and convert a uniprocessor application to an MP application. The state of the art, however, is such that this simple conversion isn't feasible in all configurations. Even when it is, the resulting performance isn't necessarily as good as expected.

Managing communications can become a significant challenge when the size of the system increases. If only a few processors are involved, it's possible to keep track of the communications manually. When there are hundreds or even thousands of processors, automation is required to supervise the configuration. In addition to the size of the problem, the configuration may change between application runs because of the availability of resources, or because hardware has failed.

Finally, there's the "ease of use" issue. This may be an overused marketing term, but the application programmer really doesn't want to deal with the complexities of an MP system. The goal is to get the application running with minimum effort and maximum performance.

CPU Performance: Maximizing the processor's performance is a relatively well-understood problem. Compilers were the first programmer productivity enhancers developed. The current technology is very advanced. Each new processor needs to have a compiler in order to be successful in the marketplace.

Achieving additional performance is possible for some applications, such as DSP and image processing, if a vector-processing library that has been tuned for the processor is provided. Getting the best performance from these libraries usually involves some amount of coding in assembly language. Fortunately, vector libraries are usually available from the vendor.

To date, these libraries haven't been standardized. An application written for one processor family doesn't port easily to another processor. The Vector Signal and Image Processing Library (VSIPL) proposed standard, funded by the Defense Advanced Research Projects Agency (DARPA), is attempting to address this issue. By bringing together major vendors and users to define a common application programming interface (API) for signal and image processing, the vendors will implement the standard. Users also will gain enhanced portability in their applications.

A number of vendors have developed compilers that automatically vectorize. These compilers understand the processor/cache/memory architecture. They also can generate vector-optimized executables from sources that are very portable. This capacity first appeared on the Cray supercomputers. It has been implemented by Digital for the Alpha, as well. SKY offers this capability for embedded real-time applications.

Multiprocessor Support: There are two classes of software support for MP environments—shared memory and message passing. In shared-memory systems, the application can execute on multiple processors. Every processor is able to directly access the same data, too. In the case of multiple executable images, system primitives are attached to the same data. A single executable also can start multiple threads, each running on a separate processor. The threads all access the same data without special system calls.

These shared-memory systems often provide compiler support to generate parallel executables with relatively minor enhancements to a uniprocessor application. The enhancements are in the form of compiler directives. They help the compiler identify sections of the application which can be executed in parallel on the available processors. Like automatic vectorization, this technology is well understood and mature. The performance achieved depends on the architecture and application. Additionally, there's better support for Fortran applications than for other languages.

If shared memory isn't available, message passing must be used to exchange and coordinate multiple executables. It also provides better performance for a number of applications on shared-memory architectures. Message passing has been available as long as there have been networks. But in practice, the software support isn't as mature as that provided by the parallel compilers.

A number of libraries have been invented over the last 10 years or so to simplify building processor applications. Parallel virtual machine (PVM), message-passage interface (MPI), and, more recently, MPI/RT are all attempts to produce standards that are available on multiple platforms. Also, vendors have produced proprietary libraries that provide optimal performance for their systems. These libraries, though, aren't portable. MPI and MPI/RT should be portable, but they often don't provide the desired performance. Using the libraries is generally likened to employing "assembly language for communications." The user must program the communications in much more detail than is necessary.

Communications Management: A few libraries are trying to simplify the communications problems. They typically layer on top of packages like MPI. Other packages, such as SPE, KHOROS, GEDAE, and RTExpress, provide a way to describe pipelined data flow.

An application is built by creating executables that take their inputs and outputs through the library. These executables are then configured to run on a number of processors forming a parallel module. The output of one module is fed to the input of the next. A reorganization of the data provided by the library partitions the data as needed, and possibly performs a transpose operation as well (see the figure).

Along with organizing the data flow, these libraries simplify configuration of a potentially large number of processors into processing modules. To start the application, the communications paths must be defined, buffers must be allocated, and a number of other bookkeeping chores must be managed.

In MPI, these chores involve a large number of calls. When managed by a data-flow library, the call total drops to just a few. Another benefit is that hardware failures can be managed more easily. Available processors are allocated to the various modules according to a simple algorithm. If a processor fails, rather than requiring a manual reconfiguration, the failed processor is simply removed from the available set. The remaining processors are then assigned according to the algorithm.

Automatic Communications: Some vital tools are starting to appear on the market. They can take a uniprocessor application and convert it to run on a distributed-memory architecture using a communications library like MPI. These tools are being developed by universities and national research labs with access to very large, high-performance computer systems. The scientists performing simulations on these systems face the same issues as the programmers creating radar applications. They want to work on their algorithm, not on the communications problem.

The state of the art here isn't very mature yet. These tools have had some limited success. But it still takes some effort to tune the application by inserting directives into the code to coerce the tool into producing an efficient application. When the tool succeeds, the effort expended is significantly less than what would have been required if the programmer had to add the communications calls manually. Unfortunately, for the present there are many applications in which a conversion tool doesn't help.