In the past, chip manufacturers increased processor clock speed to double chip performance from 100 MHz to 200 MHz and more recently into the multigigahertz range. Today, however, increasing clock speeds for performance gains is not viable because of power-consumption and heat-dissipation constraints. Instead, vendors have moved to entirely new architectures with multiple processor cores on a single chip.
With multicore processors, programmers can complete more total operations than with one core alone. But to take advantage of multicore processors, programmers must reconsider how they develop applications. In short, programmers now have to work harder to achieve continued performance improvements.
Sequential programs saw performance improvements as a result of increases in processor clock speed. Upgrading to a computer with a faster CPU meant that each instruction in a series would run faster. To continue seeing performance gains with multicore systems, developers need to design applications to divide the work among cores—in essence, develop a parallel application instead of a sequential one.
An examination of multithreaded programming concepts shows how developers can take advantage of multicore processors.
Multitasking
The first key topic is multitasking. This is of paramount importance today as engineers want to surf the Web, check e-mail, listen to iTunes, and use job-related development and business tools all at the same time.
To understand multitasking, consider the analogy of a mechanic who operates a small car repair shop:
- The mechanic defines the scheduling and priority of his customers—the operating system (OS) plays this role in a computer.
- The mechanic has a car hoist that does the heavy lifting to facilitate repair—the CPU plays this role in the computer.
- The mechanic has other employees to help do the work faster—the speed or clock rate of the CPU determines how fast the work can be completed.
- The mechanic has customers with different needs—this role is played by the different applications running on a computer.
How does the mechanic decide which car to service first if several customers are waiting? It comes down to a scheduling issue.
OSs face this same dilemma of deciding which task is most important to execute, how fast the task needs to be completed, and in general, how best to use the available system resources. A basic scheduling concept such as round robin used in some OSs defines time slices for each task and ensures every task is treated equally.
A more sophisticated scheduling concept found in commercial OSs is preemptive multitasking where tasks share a CPU but you can obtain a higher priority and jump to the front of the execution line. For example, if an ambulance needs service, the mechanic most likely will repair it first so it can answer emergency calls. While multitasking appears to be the answer to completing several actions at one time, really only the most important task is guaranteed to execute in a preemptive multitasking scheme.
Now consider the car hoist, which is doing all of the heavy lifting. It enables work but, at the same time, it can become a bottleneck. At the point when too many customers are waiting in the shop at the same time, the mechanic has to turn down work because there is no way to fix all of the cars.
One way to repair more cars faster is to hire more employees. Another answer would be to add more hoists so mechanics could fix more cars simultaneously.
Adding more computing engines to perform the heavy lifting is the only way to increase overall work throughput. This was not so much a choice as it was a necessity for silicon manufacturers that could not increase CPU clock speeds anymore.
It is estimated that clock speeds will continue to slowly climb, finally reaching an absolute limit around the 10-GHz range. Instead of fighting the inevitable, chip vendors are producing multicore chips.
OSs support multicore processors through symmetric multiprocessing (SMP), meaning that an OS may schedule threads ready to run on any available CPU. Adding a second core to a computer is analogous to the mechanic who adds a second car hoist; in other words, the overall potential for work throughput is doubled.
This means consumers can achieve true multitasking because multicore CPUs can divide the applications as shown in Figure 1.
While the OS can handle multitasking for all the applications running on a system, software developers may want to create applications that are broken into unique tasks to use the available power in a multicore system. This can be accomplished by breaking an individual application into threads.
Multithreaded Programming
The Basics
A process is the data, code, and other information required by an OS system to execute instructions. It is made up of one or more threads, and the threads represent unique instruction sequences that are independent of one another.
Threads share the same state of the process in which they reside and can access the same functions and data. In other words, the threads residing in the same process see the same things. Applications that are not multithreaded are executed in the main() thread.
One question that arises with moving to multithreaded architectures is, “How much code overhead is all this threading going to take up in an application?†As far as lines of code, it is pretty minimal: 99% of an application will be the same as it was before, and the other 1% will be spent on thread management and synchronization. Keep in mind that the 1% of code may take a large percentage of the development time.
The real work required in developing a multithreaded application is not so much creating the threads but instead managing the states of the threads and ensuring proper communications and synchronization between the threads. When the application is running, the threads can run in one of four states: Running, Ready to Run, Blocked, or Terminated. These states must be managed effectively so the correct results of an application are achieved.
Thread Management
Because threads have access to the same memory, programmers must only use resources not needed simultaneously by some other caller in the application. If a resource is needed simultaneously, a thread must block another thread from inadvertently using it via structures called locks and semaphores. An application that takes this into account is called thread safe.
When multiple threads become hard to follow, common programming pitfalls can arise, such as the following inefficiencies due to too many threads:
- Deadlock—Threads become stuck waiting and cannot proceed processing.
- Race Conditions—The timing of code execution is not correctly managed, and data is either not available when it needs to be, or the correct data has been overwritten.
- Memory Contention—Multiple threads try to access memory at the same time.
Synchronization
Synchronization plays an important role in defining the execution order among threads in an application. The most common method of synchronization is called mutual exclusion. Mutual exclusion occurs when one thread blocks a critical section of code to prevent other threads from accessing that section until it is their turn.
To synchronize threads, primitives based on the lowest level of code implementation are used. Examples of synchronization primitives are semaphores, locks, and conditional variables.
Consider an example of four threads running in parallel. Assume a master thread must wait until all threads are complete before proceeding. As shown in the left pictorial of Figure 2, the threads are not properly synchronized, and the master thread inadvertently proceeds with an incorrect value. In the right pictorial, the threads are properly synchronized so the master thread waits until all other threads have completed, then executes with the proper value.
Multithreaded Programming Strategies
When combined with multithreaded programming, optimization techniques can deliver tremendous performance improvements on multicore systems. Three example strategies include task parallelism, data parallelism, and pipelining.1
Task Parallelism
Task parallelism is one of the most intuitive ways to break up an application. Each function or task is taken individually, and the application is written in a multithreaded fashion based on the separation of tasks.
Data Parallelism
Data parallelism is a programming technique for splitting a large data set into smaller chunks that can be operated on in parallel. After the data has been processed, it is combined back into a single data set. With this technique, programmers can modify a process that typically would not be capable of using multicore processing power so that it can efficiently use all processing power available.
Pipelining
Pipelining is a technique for improving the performance of serial software tasks. It is the process of dividing a serial task into concrete stages that can be executed in assembly-line fashion.
Libraries and Tools for Multithreading
Libraries and tools can assist with multithreading by simplifying some of the low-level details. Advantages include greater productivity in the current development cycle and, in some cases, more scalable code later.
Threading Libraries
Threading libraries present a method for developers to ease their transition to parallel code. An example is a programming model called OpenMP, which uses compiler directives to break up serial code and run in parallel.2
Originally formed as an application programming interface (API) in 1997, OpenMP is a platform-independent set of compiler directives that supports Fortran, C, and C++. It has the capability to parallelize for loops. To illustrate this point, consider a code example from Multi-Core Programming by Akhter and Roberts.3 This loop converts a pixel from a 32-bit RGB in an array into an 8-bit gray scale.
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114 ) ;
}
The only modification to this loop was the pragma inserted immediately prior to entering the loop. OpenMP has an advantage over native threading options such as Windows threading API or Pthreads because an optimal number of threads are used on the target system.
Graphical Programming
Graphical programming tools can increase developer productivity by handling many of the multithreading intricacies in the background. With graphical programming, developers can see the parallelism in the code.
An example is NI’s LabVIEW, a fully compiled, graphical programming language based on structured data flow. Data flow is represented between function blocks via connections in a block diagram. These connections also play an important role in multithreaded programming by serving as data dependencies and synchronization elements.
Similar to how OpenMP uses an optimal number of threads to execute parallel code, LabVIEW also offers thread scalability depending on the underlying hardware. For example, on a higher-end N core system, the LabVIEW compiler creates more system threads to better utilize all the available cores in the system.
By moving to a higher-level abstraction, developers may feel like they are losing the low-level threading control. This is a valid concern, so it is important that graphical tools offer a means for explicit threading in addition to any abstraction that is provided.
In LabVIEW, for example, specific structures are provided that spawn unique threads. In addition, those threads can be assigned to a specific processor core using an affinity assignment, and the thread pool itself can be managed at the application level.
Debugging Tools
Debugging plays a critical role in any development effort, but it deserves special mention in regard to multithreaded programming and the challenges that can arise. In particular, the new multicore hardware architectures use a shared-memory architecture, which can present some interesting cache contention issues that previously were not a problem.
Debugging tools can assist developers in providing insight into thread activity. This can be very useful on multicore systems to distinguish thread activity running on the different cores.
Conclusions
As hardware architectures move to multicore processors, multithreaded programming is becoming an even more critical technique for software developers to master. After mastering the basics, programming strategies including task parallelism, data parallelism, and pipelining may be implemented to further optimize performance.
Findings from the Landscape of Parallel Computing Research project at the University of California, Berkeley, expressed it best: “Since real-world applications and hardware architectures are inherently parallel, so too should be the software programming model.â€4 For software developers today, this means that while understanding threads is important, programming models and tools that express parallelism natively should be considered to assist in multithreading code and providing scalability benefits for future applications.
References
- “Multicore Programming Fundamentals Whitepaper Series,†National Instruments, www.ni.com/multicore
- OpenMP and Supported Compilers, www.openmp.org
- Akhter, S. and Roberts, J., Multi-Core Programming: Increasing Performance Through Software Multi-Threading, 2006.
- Asanovic, K., Bodik, R., et al., “The Landscape of Parallel Computing Research: A View from Berkeley,†Technical Report No. UCB/EECS-2006-183, University of California—Berkeley, Dec. 18, 2006.
About the Author
Jeff Meisel is the product marketing manager for National Instruments LabVIEW Real-Time.
Mr. Meisel started his career at NI in 2004 as a member of the Engineering Leadership Program.
He is an active member of IEEE and holds a B.S. in computer engineering from Kansas State University. National Instruments, 11500 N. Mopac Expwy., Austin, TX 78759, 512-683-8795,
e-mail: [email protected]
December 2007