Electronic Design

Software Frameworks Tackle Load Distribution

Multiple-core designs can go in several directions when it comes to distributing the load

Networking and multicore chips inevitably create greater complexity, which requires solutions replete with sophisticated communication and thread management. Distributing the load is the desired result, allowing more hardware to be incorporated into a system in a coordinated fashion. The development task isn’t easy, but the use of one or more software frameworks can alleviate some of the stresses.

“Software frameworks,” a wonderfully vague term, covers a lot of ground. It’s been applied to language-based platforms like Java and Microsoft’s .NET as well as graphical interfaces, Web server platforms, and a host of other software areas.

While distributed communications and process coordination includes Intel’s Thread Building Blocks (see “Threads Make The Move To Open Source” at www.electronicdesign.com, ED Online 16538), it also includes Data Distribution Service (DDS) for Real-time Systems, the Message Passing Interface (MPI), and the Concurrency and Coordination Runtime (CCR) that’s part of the Microsoft Robotics Studio (see “MS Robotics Studio” at ED Online 16631).

In theory, all three could be used within a system since they address different issues, though developers typically try to minimize the number of frameworks within a design simply to reduce complexity. Still, the use of frameworks can significantly cut down the amount of new code required for a job. It also can address features that otherwise might not be considered, such as multilevel security and redundancy.

Likewise, these data-centric frameworks often provide scalability that’s not available with other approaches. In fact, all three are designed to scale to very large environments with thousands of nodes. While these frameworks are commonly deployed in large networks, they’re also of interest in more compact solutions with ever-increasing amounts of cores on a chip or board.

The DDS for Real-time Systems Specification comes from the Object Management Group (OMG), which brought such favorites as UML (Universal Modeling Language) and the Common Object Request Broker Architecture (CORBA). CORBA is another data-distribution system, but it employs direct connections and remote procedure calls (RPCs), whereas DDS uses a publish/subscribe architecture (Fig. 1). Companies such as Real-Time Innovations and PrismTech provide DDS implementations. Other publish-subscribe models include the Java Message Service (JMS).

Unlike direct connect systems like CORBA and Sockets, DDS information may go nowhere. The concept of lifespan means the validity of information may expire before delivery. The DDS system handles delivery, which is based on the type, called topic, of data that’s generated. The ability to specify different information and quality of service (QoS) by subscribers and publishers allows designers to create large and robust systems. The system uses a platform-independent model that makes DDS easier to deploy in heterogeneous environments.

At the heart of DDS lies the Data Centric Publish-Subscribe (DCPS) layer. This interface enables applications to describe and publish data that’s automatically distributed to subscribers, which specify the type of data they’re looking for. Policies can be used to control parameters such as resource limits and QoS. Subscribers can also track liveliness that indicates whether the publisher is still alive if it hasn’t generated any data in some time.

Data can be filtered by content or by time, which is a critical divergence from the normal explicit multithreading constructs. It provides a significant performance optimization, since tests can be performed at the source with only the requested data being sent to a subscriber. Deadlines, latency, and other timing aspects are well defined with DDS, not implicit within application code.

Like most communication systems, DDS defines support for dataflow routing, discovery, and data typing. DDS also provides data filtering, transformation, and connectivity monitoring services plus the ability to specify redundancy and replication, and delivery effort. It can also handle resource management and status notifications.

The Data Local Reconstruction Layer (DLRL) is an optional portion of the DDS specification. It provides seamless integration with the native languages like C and C++. Typical DDS runtimes are decentralized in their implementation. This allows for a more robust solution and one that’s at home with transient connections. Overall, DDS takes a relatively simple publish-subscribe model and wraps it in a rather extensive test of functions that can service a wide range of applications.

MPI is typically used in high-performance computing (HPC), though the architecture applies to a range of applications with a large number of processors. It provides a number of communication mechanisms that work on a range of platforms, from shared-memory systems to networked computers (Fig. 2). Designed for use in applications where processes cooperate with each other, it can handle groups or collections of processes as well as hierarchical process groups.

MPI implementations support the standard MPI application programming interface (API) in addition to the runtime support necessary for both message passing and distributed thread control. The system supports blocking and non-blocking messaging. It also provides fine-grain control of remote threads and processes. MPI generally is built atop sockets-based standard protocols such as TCP/IP.

Processes are grouped into objects called communicators. Communication can occur within the communicator as well as between communicators. Each process within a communicator has a rank. A communicator can have one or more contexts that partition the communication space. It also controls the scope of communication.

MPI addresses a range of communication methodologies, from basic pointto- point messaging to parallel operations such as scatter-gather distribution as well as broadcasting and summation-style operations. These operations include sum, max, and min in addition to user-defined functions.

Continue to page 2

The system can handle almost any logical or physical communication topology. Common topologies include rings and matrices. MPI provides a way of defining and controlling messaging within these communication architectures. It also supports matrix data mapping with respect to messages. This support for matrix topologies and data isn’t surprising, since MPI is often employed in scientific applications that do lots of matrix manipulation.

The latest version of MPI, MPI-2, includes a number of features such as remote memory access (RMA). RMA bypasses the send/receive protocol, permitting one-sided initiation and completion of operations. MPI-2 also added parallel I/O and dynamic process and thread management.

MPI includes a profiling interface that helps debug and tune a system. Applications are typically written in C/C++ and Fortran, and there are interfaces for a number of other languages such as Java.

Microsoft Robotics Studio is built on two main components: the CCR and the XML-based Decentralized Software Services Protocol (DSSP). The CCR is a managed code library (DLL) accessible from any language targeting the .NET 2.0 Common Language Runtime (CLR) (Fig. 3). This includes the Compact Framework, Windows CE, Windows XP, and Windows Vista in addition to the Windows server products.

The CCR addresses robotics applications that often attempt to exploit multiprocessor and multicore platforms that must deal with asynchronous operations on a regular basis. But the CCR equally applies to service-oriented applications that require a robust communication and scheduling system.

Like DDS, CCR is designed for loosely coupled systems where components are often developed and deployed independently. These systems frequently require robust failure and isolation support. The theory behind the CCR is to move programmers away from conventional, errorprone methods to explicitly synchronize access to that shared memory. Typical thread primitives, such as locks and monitors for shared memory, give way to scheduling and protocol design of the messaging system, which tends to scale much better than explicit memory management.

Ports are used to accept incoming messages. These messages are queued and passed along to arbiters that determine what to do with the data and what code will be dispatched to handle the information. Arbiters, which can be combined, are designed to handle multiple requests.

The CCR uses dispatcher objects with thread and message queues to provide synchronization and distribution of jobs. Threads are reused, and communication essentially employs a messaging model that enables the environment to be distributed more easily. The approach lends itself to object-oriented programming languages like Microsoft’s C#, where applications may have thousands of objects requiring thread services. This approach to pooling threads isn’t unique to CCR. It’s common in other frameworks, such as Java 2 Enterprise Edition (J2EE).

CCR is designed to work closely with C# and other .NET-based languages. For example, code using C# iterators can employ the yield statement that permits implicit definition of an enumerator delegate. The integration allows for implementation of CCR-style iterators using the same programming style.

None of these three systems answers all of the problems that may be encountered in a distributed multithreaded environment, but each has its place. As frameworks, one or more may be used as part of an application or set of applications. The key is to eliminate as much work for the developer while providing services that would otherwise need to be developed as part of the application. Utilization of standard frameworks provides access to developers with framework experience, as well as a support system.

General frameworks like MPI and DDS can be utilized on a range of processing platforms. But a system’s architecture often may require frameworks that are more specific. Examples include Mercury Computer Systems’ MultiCore Framework (MCF) for the IBM Cell processor (Fig. 4) and NVidia’s CUDA (Compute Unified Device Architecture) for the G8X GPUs (graphics processing unit), which are found on NVidia’s Tesla computing adapters (Fig. 5) as well as on a number of NVidia video cards.

MCF takes advantage of the multiple synergistic processor elements (SPEs) in the Cell processor. Each SPE has its own local memory. The Power processor element (PPE) has access to the large main memory. While the PPE supports virtual memory, the SPEs do not. The SPEs do the heavy lifting, so it is important to keep them doing real work. That means the care and feeding of memory is critical to efficient SPE operation.

The MCF, which runs on the SPEs and PPE, minimizes data access latency while providing overlapped communication and computation on the SPEs. It enables designers to choose the computational and storage granularity for a problem with the idea of limiting SPE code to DMA setup and computational chores. The system breaks memory up into blocks called tiles.

A tile contains data as well as a descriptor that includes details like its virtual memory address on the PPE side. The MCF manager runs on the PPE and feeds an input channel with tiles that are then distributed to worker tasks running on the SPEs.

Result tiles from the worker application are placed in an output channel that the PPE reads. The SPEs run a small, 12-kbyte kernel that manages the DMA data exchange and the worker code used on the SPE. Multiple tasks can run on a single SPE.

Continue to page 3

The system supports sparse matrices, including the ability to transpose result data. Likewise, it lets developers write applications for the SPE that do not have to deal with the memory- management complexities of the Cell architecture. Furthermore, Cell processor systems deal with stream video processing—another area where NVidia’s approach is used, but the chip architecture is much different.

NVidia’s CUDA also is tasked with optimizing a developer’s time in creating applications that take advantage of a GPU. CUDA includes a native C compiler along with FFT and BLAS libraries, a profiler, and a gdb debugger, as well as a host runtime driver. The adapters typically plug into a 16x PCI Express slot.

Sample applications address operations such as image convolution, binomial option pricing, and Sobel edge detection. CUDA provides host support for Fortran and has a MathWorks Matlab plug-in. It works with the Tesla adapters in addition to a number of NVidia graphics adapters.

CUDA has a similar memory-management issue like the Cell processors, but the interaction is more flexible. The complexity comes in the type of processors, the data flow, and their management. The chips were designed for graphics processing, so they are amenable to streaming algorithms of this type. But the architecture can handle more general algorithms as well.

Because of the limitations, though, CUDA addresses thread and memory management. For example, a fast shared memory region typically can be used for texture lookups in graphics applications. But it also can be used for general communication among threads. The architecture uses a SIMD model so groups of 32 threads, called thread blocks, run simultaneously, executing the same code but on different data.

The CUDA framework’s host processor is physically separated from the computational elements, unlike the Cell, which incorporates the PPE on the same chip as the SPEs. This means that additional worker cores can be added in the CUDA approach by adding more adapter boards.

It also means the framework must account for this capability.

AMD also has its ATI line of graphics cards and a FireStream GPU adapter on par with NVidia’s Tesla, so it is not surprising to find a similar framework to support a similar streaming type of architecture. Likewise, its video gaming roots are apparent in some target applications such as physics simulation support for games.

AMD calls its framework the Stream Computing Software Stack. Its native development tool is based on the Brook C/ C++ compiler. The Brook compiler incorporates data parallel computing ideas and targets a range of platforms. It also adds new data types such as streams plus scatter/ gather operations. The AMD incarnation addresses the BrookGPU subset that accounts for the advantages and limitations of a GPU like the FireStream.

Frameworks such as those mentioned in this article will be critical in taking advantage of multicore platforms, whether they’re heterogeneous or homogeneous, because they remove much of the management complexity of the underlying system from the programmer’s perspective.

Brook Language
Mercury Computer System
Message Passing Interface Forum
Object Management Group
Real-Time Innovations


Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.