Rise Of Multiprocessing/Multithreading Sharpens Focus On Interrupts

Potentially substantial performance gains from the use of multithreading and multiprocessing architectures have captured the attention of designers of consumer devices and other electronic products. Multithreading uses cycles when the processor would otherwise sit idle to process instructions from other threads. Multiprocessing, on the other hand, introduces additional independent processing elements in order to execute threads or applications concurrently. Embedded applications running in multiprocessor and multithreading architectures, just like those running in conventional applications, require interrupt service routines (ISRs) to handle interrupts generated by external events.

One key challenge for designers implementing these new technologies is avoiding the situation where one thread is interrupted while modifying a critical data structure, enabling a different thread to make other changes to the same structure. Conventional applications overcome this problem by briefly locking out interrupts while an ISR or system service modifies crucial data structures.

In a multithreaded or multiprocessing application, this approach isn't sufficient because of the potential for a switch to a different thread context (TC), or access by a different processing element that's not impeded by the interrupt lockout. A more comprehensive approach is required, such as disabling multithreading or halting other processing elements while the data structure is being modified.

IMPROVING PERFORMANCE
Manufacturers of consumer devices and other embedded computing products are eagerly adding new features, such as Wi-Fi, VoIP, Bluetooth, and video. Historically, increased feature sets have been accommodated by ramping up the processor's clock speed. In the embedded space, this approach rapidly loses viability because most devices are already running up against power consumption and real-estate constraints that limit additional processor speed increases. Cycle-speed increases drive exponentially greater power consumption, making high cycle speeds unmanageable for more and more embedded applications.

In addition, processors are already so much faster than memory that more than half the cycles in many applications are spent waiting while the cache line is refilled. Each time there's a cache miss or another condition that requires off-chip memory access, the processor needs to load a cache line from memory, write those words into the cache, update the translation lookaside buffer (TLB), write the old cache line into memory, and resume the thread. MIPS Technologies stated that a high-end synthesizable core taking 25 cache misses per thousand instructions (a plausible value for multimedia code) could be stalled more than 50% of the time if it must wait 50 cycles for a cache fill.

MULTITHREADING APPROACH
Multithreading solves this problem by using the cycles that the processor would otherwise waste while waiting for memory access. It can then handle multiple concurrent threads of program execution. When one thread stalls waiting for memory, another thread immediately presents itself to the processor to keep computing resources fully occupied.

Notably, conventional processors can't use this approach because it takes a large number of cycles to switch the TC from one to another. Multiple application threads must be immediately available and "ready-to-run" on a cycle-by-cycle basis for this approach to work. MIPS accommodates this requirement through its incorporation of multiple TCs, each of which can retain the context of a distinct application thread (Fig. 1).

In a multithreaded environment such as the MIPS 34K processor, performance can be substantially improved— when one thread waits for a memory access, another thread can use that processor cycle that would otherwise be wasted.

Figure 1 shows how multithreading can speed up an application. With just Thread0 running, only five out of 13 processor cycles are used for instruction execution and the rest are spent waiting for the word to be loaded into cache from memory. In this case, when using conventional processing, the efficiency is only 38%. Adding Thread1 makes it possible to use five additional processor cycles that were previously wasted. With 10 out of 13 processor cycles now used, efficiency improves to 77%, providing a 100% speedup over the base case. Adding Thread2 makes it possible to fully load the processor, executing instructions on 13 out of 13 cycles for 100% efficiency. This represents a 263% speedup when compared to the base case.

MULTIPROCESSING APPROACH

Multiprocessing, on the other hand, combines multiple processing units (each capable of running a separate concurrent thread) into a single system. Often, they're combined onto on a single die, as is the case in ARM's MPCore multiprocessor.

In the MPCore's symmetric multiprocessing (SMP) configuration, the individual processor cores are connected using a high-speed bus. They share memory and peripherals using a common bus interface. Generally, the SMP system runs a single instance of the real-time operating system (RTOS) that manages all "n" of the processor cores. The RTOS ensures that the n highest-priority threads are running at any given time.

The primary software challenge in a multiprocessor system is partitioning the design and adding tasks. The primary hardware challenge is finding the right infrastructure to ensure high-bandwidth communications among processors, memory, and peripherals.

An SMP system can be scaled by adding cores and peripheral devices to execute more tasks in parallel. In an ideal world, moving from one processor to n processors would increase the speed of the core by a factor of n. Generally speaking, such an approach allows multiprocessing to be quite scalable and often simplifies the design.

Intel states that it is more power-efficient to have multiple small cores each run individual threads than to have a single large processor run multiple threads. A multicore design also enables cores to share or duplicate processor resources, such as cache. The resulting efficiencies permit multicore designs to boost simultaneous performance without a corresponding increase in power.

IMPORTANCE OF INTERRUPTS
Interrupts are critical in a conventional embedded application because they provide the primary, and in many cases, the only means for switching from one thread to another. Interrupts fulfill exactly the same role in multithreading and multiprocessing applications as they do in a conventional application. However, there's an important difference to note: In a multithreaded or multiprocessing application, changes from one thread to another occur not only through interrupts, but also as a result of the system's ability to run multiple, independent thread contexts concurrently using spare CPU cycles or additional processors.

It's absolutely essential to avoid the situation where one thread is modifying a critical data structure, while a different thread is making other changes to the same structure. This could easily result in the data structure being left in an inconsistent state, with potentially catastrophic results.

Typically, there are two approaches to address this concern, one used by the majority of RTOSs, and the other used by a few. The more popular approach is to briefly lock out interrupts while a system service, called by an application via the service API, modifies critical data structures inside the RTOS. This reliably prevents any other program from jumping in and making uncoordinated changes to the critical area being used by the executing code. This approach is called the "Unified Interrupt Architecture," because all interrupt processing is performed at one time, in a single, "unified" interrupt service routine (ISR).

Another approach is not to disable interrupts in system service routines, but rather (by rule or convention) not to allow any asynchronous access to critical data structures by ISRs or other service calls. Service-call access to critical data structures from an ISR is "deferred" to a secondary routine we denote "ISR2," which gets executed along with application threads under scheduler control.

This approach also reliably prevents interference with the actions of an executing system service call—it doesn't allow any threads or ISR2 routines, which might make system service calls, to execute until processing of critical data structures is completed. This approach is called a "Segmented Interrupt Architecture," because it breaks up the processing required in response to an interrupt into multiple (usually two) "segments" executed at different priorities.

TWO INTERRUPT ARCHITECTURE APPROACHES
Table 1 provides a list of symbols used to represent the processing performed in each type of RTOS interrupt architecture. Table 2 depicts the functional components of the unified interrupt architecture in two different cases.

The total system overhead is greater in the RTOS with a segmented interrupt architecture. In both the non-preemption and preemption cases, the segmented interrupt architecture RTOS introduces an additional (1*CS, 1*CR, 2*S, and 1*CC) of overhead (Fig. 2). It appears the most wasteful overhead case is the non-preemptive case, since the unified interrupt RTOS simply returns to the point of interrupt if a higher-priority thread wasn't made ready by the ISR processing.

Another performance benefit of the unified RTOS approach is that only the interrupted thread's scratch registers need to be saved/restored in this case. This isn't possible with a segmented interrupt RTOS, since it doesn't know what the ISR2 portion of the ISR will do during the actual interrupt processing. Hence, segmented interrupt RTOSs must save the full thread context on every interrupt. In the non-preemptive case, the CS and CR performance is much slower in the segmented interrupt RTOS, although this additional overhead hasn't been factored into this comparison.

The segmented interrupt architecture is claimed to have an advantage with regard to its response to interrupts. The whole idea behind never disabling interrupts is to make interrupt response faster. However, while it sounds good, several practical problems crop up with this approach.

Although the segmented interrupt RTOS doesn't disable interrupts, the hardware itself does when processing other interrupts. Therefore, the worst-case L in the segmented interrupt RTOS is actually the time interrupts are locked out during the processing of another interrupt. Also, interrupts could be locked out frequently in an application if the segmented interrupt RTOS uses a trap or software interrupt to process RTOS service requests. In such cases, the hardware will lock out interrupts while processing the trap. Finally, the application itself might have interrupt lockout so that it can manipulate data structures shared among multiple threads. All of these issues make "L" a non-zero value and largely defeat the purpose of designing an RTOS with the claim of L approaching zero.

SPECIAL INTERRUPT CHALLENGES
Briefly locking out interrupts while an ISR or system service modifies crucial data structures inside the RTOS will reliably prevent any other program in a conventional application from jumping in and making uncoordinated changes to the critical area being used by the executing code.

This approach, though, isn't sufficient in a multithreaded or multiprocessing application, since there's the potential for a switch to a different TC. Or, a memory reference instruction may be executed by a different processor core that's not impeded by the interrupt lockout. As a result, it might operate on the critical area. A variety of methods can be used to disable multithreading/multiprocessing while the data structure is being modified.

Multithreading and multiprocessing meet the demands for consumer, networking, storage, and industrial device applications when it comes to high performance, with only minor increases in cost and power consumption. The multithreading and multiprocessing approaches have their own advantages and strengths, and there's no reason the two approaches can't be combined for "the best of both worlds."

With relatively simple exceptions, application code can run unchanged when moving from conventional to multithreaded or multiprocessor applications. Thanks to multithreading, it's easy and inexpensive to use the CPU cycles that are often wasted by conventional RISC processors or multiprocessing, and thus exploit the economies of scale via multiple cores.