Master On-Chip Embedded Multiprocessor Coherence

Without a doubt, embedded systems-on-a-chip (SoCs) are becoming "software-rich,"¹ and they're incorporating more and more processors on one chip. The driving forces behind these changes are advances in fabrication technology (propelled by Moore's Law) to address short time-to-market pressures, greater design complexity, and the amortizing of high-cost ASIC fabrication through design reuse.

There's also the economic benefit of higher performance with backward-compatibility to a single-threaded model of computation (the so-called Von Neumann model). That model has long plagued general-purpose computing. Now, such a performance benefit becomes applicable to high-throughput, software-rich embedded SoCs. Examples include high-end set-top boxes, smart phones, automotive media centers, and printer/copier stations.

Current high-end embedded SoCs are mostly heterogeneous. The processors on these SoCs communicate through noncoherent, shared memory using some form of message passing. The classic RISC/DSP combination in a third-generation cell phone communicating through a dual-ported SRAM and interrupts represents a good example of these simple schemes.

When sheer clock-speed scaling ran out of steam, maintaining this single-threaded programming abstraction forced general-purpose uniprocessor designers to resort to dual- or quad-processor coherent systems. The same will happen for these software-rich, high-performance embedded systems—with slight modifications.

Future high-performance SoCs will be hierarchical and heterogeneous systems of processors with coherent clusters of homogeneous multiprocessors embedded in the hierarchy. Some of this transition already has been observed in one specific high-performance embedded market: networking (in the form of coherent network multiprocessors).^2,3

The exact nature of future embedded chip multiprocessors (CMPs) is debatable (heterogeneous versus heterogeneous with hierarchical homogeneous processors). But for many of them, shared memory with coherence will be an important issue.

Definition And Basics A multicore shared-memory system with caches is considered to be cache-coherent if the value returned by any Load (issued by a processor) is always the value of the latest Store to that memory location. To address the ambiguity of the term "latest Store," we're forced to take a small diversion into memory models. We use the help of a common memory model like sequential consistency (SC), where the results of any execution of a parallel program on an SC system make it possible to construct a global serial order of all operations (mainly Loads and Stores) to a location. Then coherence implies:

The order of Loads and Stores from each processor appears in the system's global serial order in the same way in which they were issued to the memory system by that processor.
The value returned by each read from a processor in the system is the value written by the last write to that location in the global serial order.

Therefore, the term "global serial order" is a product of the memory consistency model (memory model for short) implemented by the system (informally termed Weak, Strong...). The memory model relates to the instruction set architecture (ISA) for single processors, which defines the operational contract between the compiler and the hardware (Fig. 1).

The ISA defines the contract between the programmer and the memory system for a multiprocessor or, more generally speaking, a multithreaded system. Hence, multithreaded languages like Java also have a defined memory model. In this article, most occurrences of multiprocessing can be substituted with multithreading.

SC, total store ordering (TSO), and processor consistency (PC) are some of the common memory models at the machine level (from strong to weak). Stronger implies that more constraints are imposed on the parallel memory-system implementer, which makes the tasks performed by the parallel middleware or system-library writer a bit simpler.

Another way to look at coherence is that it's the weakest form of memory consistency, since it doesn't restrict memory operations any more than what is necessary to provide a reasonable memory system from a single-processor point of view. Informally, stronger models help the programmer by ensuring that a parallel memory system guarantees more than just "Reads return the value from the latest Store." These added guarantees are typically used to form efficient synchronizing constructs between threads or processors.

To achieve coherence, a system must have a few essential properties. For one, Writes to a particular memory location must be serialized at some point in the system. Note that serialization is a logical concept. For some high-performance speculative implementations, it's only a guideline for returning transactions during commit. It's similar to "out-of-order" processors, which maintain a temporary state and an "architectural state" separated by a commit point.

Another property of coherent systems is Write propagation, which implies that a Write needs to eventually propagate to all agents that care about the new value. The third important property (a result of the memory model rather than coherence) is Write atomicity, which implies that a write needs to be propagated in its entirety to all processors in the system after they're serialized.

We only will mention the common way of classifying coherence protocols. This classification is based on the stable states of the caches in the system. The common states are referred to as "MOESI": Modified, Owned, Exclusive clean, Shared clean, and Invalid. The terms are self-explanatory, and details are readily available in textbooks.⁴

Related to state-based protocol classification is whether the protocol is update- or invalidate-based. In an invalidate-based coherence protocol, the invariant maintained in the system is that only a single owner of a cache line exists in the system. In an update-based system, all copies of the cache line are updated on a Write.

Serialization Many older symmetric-multiprocessing (SMP) (non-CMP) systems used a bus to broadcast transactions to all agents in the system. Therefore, the agents could "snoop" their state and then take the proper actions to invalidate and update their copies of the data item. The overlap between the different phases of a transaction was minimal and restricted to in-order slip (pipelining).

But for reasons of bandwidth scalability, limitation of speed, and scalability of buses, these rigid snoopy schemes evolved to a couple of newer coherence schemes. At the high end (but still relevant for embedded CMPs, albeit for different reasons), directory-based schemes are common. When there's a low degree of multiprocessing, snoopy "virtual bus" schemes often are the preferred routes.

Snoopy virtual-bus serialization uses specialized higher-performance interconnects, especially in the request phase of a transaction, such as a tree of switches or hierarchical rings (Fig. 2). In these systems, the interconnect is responsible for creating the global serial order while moving from a limiting physical-bus-based interconnect to higher-performance (e.g., serial) point-to-point signaling links.

Directory-based schemes,⁵ on the other hand, perform the serialization at a new construct called a directory. This directory, which usually resides in the memory module, holds the state of the various cache lines in the system. In general, these systems are a great deal less dependent on the network for serialization and ordering compared to snoopy schemes (virtual or otherwise). Because the number of messages isn't broadcast in directory schemes, they can scale to much larger systems.

Another trend affecting on-chip coherence is that next-generation SoCs (with multiple processors) are following a methodology of separating communications from computation, for reasons of complexity mitigation. This has resulted in design methodologies based on networks-on-a-chip (NoCs),⁶ and the movement from circuit-switched to packet-switched NoCs.⁷ Any on-chip coherence scheme needs to heed this important move in deep-submicron SoCs and layer the coherence protocol on a packet-switched substrate.

Embedded SoCs have added issues with cost, low power, real-time operation, intellectual-property (IP) ownership, and possibly heterogeneous processors. Consequently, selecting the coherence scheme is a bit different from their general-purpose counterparts. Low power results in lower system cost, which is a sensitive factor for SoCs. Moreover, if a SoC is used in a mobile application, low power certainly becomes a necessity.

Just as it took a while for caches to break into the DSP world (cycle-accurate processor and system simulators were the key tools that helped accelerate this transition), the same is true for coherence. To port software to a real-time system, a coherence/SoC designer must ensure that a sufficiently cycle-approximate (and fast) simulator is available for the application/middleware port. The problem is a bit more severe in high-performance embedded SoCs, since programmers are exposed to the hardware more than in a general-purpose multiprocessor. In the latter, a restricted set of "system" (middleware, libs, operating system) programmers are exposed to this interface.

IP ownership is a unique feature of embedded SoCs. Most general-purpose CMP vendors' designs don't incorporate any outside IP at the memory-bus level (the level at which coherence is relevant). But outside IP is routine for an embedded-SoC integrator, so much so that even the interconnect (e.g., OCP-IP)⁸ in many high-performance embedded SoCs is an IP block acquired from an outside IP vendor. Moreover, a high-performance embedded SoC could sometimes benefit from heterogeneous ISA cores sharing the same memory coherently (say, a RISC core and a DSP).

Looking at these trends, the relevance of snoopy virtual-bus coherence schemes to CMPs should be obvious: limited scalability, lots of on-chip bandwidth, point-to-point signaling, less overhead, and low latency. But it's interesting that directory schemes, which are generally considered as applicable only to large server-class machines, are also relevant to embedded SoCs (with possible modification). That's because they can work with unordered interconnects, heterogeneous ISAs, lower-power unicast transactions, etc.

While the first generations of embedded CMPs may opt for just a snoopy virtual-bus scheme, it is predicted that more interesting hybrid snoopy-directory schemes may be the next trend in embedded coherence. That's because designers will come to appreciate the modularity benefits of directory-based schemes.

Deadlock/Livelock In addition to choosing the method of serialization and type of coherence protocol, cache-coherence protocol designers must guarantee that the protocol is deadlock- and livelock-free, given limited resource/buffer constraints. This is particularly relevant in packet-switched, interconnect-based coherence.

There are two types of deadlocks—interconnect and protocol. Both generally occur due to buffer constraints in a packet-switched interconnect. Protocol deadlocks should be carefully considered when designing coherence protocols (Fig. 3a). Common schemes that prevent deadlocks include separating a transaction's request path from the reply/response path, and guaranteeing that a cache or a memory agent responds to a request in any state.

To accomplish the first scheme, designers usually use virtual channels⁷(Fig. 3b). Transactions flowing in any virtual channel follow a FIFO order, and a blocking event in the stream causes a backpressure that can be traced all the way to the source. Hence, as long as the sinks (of transactions) make forward progress, so does the system.

Livelock in a distributed system takes place when there's a halt to forward progress. At the processor, this is reflected in the program counter of some Load/Store not making forward progress. This frequently occurs when multiple caches try unsuccessfully to gain ownership of a cache line. If a global serial order is properly established in the system, each agent can handle requests in that order. The global serial order itself must be established in a fair manner. Various resources (ports, buses, buffers) all need to be fairly allocated to the multiple threads/processors.

Another concern related to livelock prevention is flow control. A system's flow control limits resource allocation. Done in an ad-hoc manner, it could result in livelock. A common case is overuses of retries or negative acknowledgements (NACKs) while responding to a request.

Other Considerations Beyond deadlock and livelock, designers should consider the following issues:

Cache hierarchies and DMA: Issues of deadlock surface as transactions traverse the cache hierarchy. Usually, one can adopt the same mechanism used in the broader protocol to keep the requests and replies on separate (virtual or real) channels/FIFOs.

Another issue concerns determining the level at which to enforce coherence (L1 Cache, L2 Cache, or L3 Cache). Where will the I/O enter/extract cache lines from the coherence domain? Solutions that involve issues of inclusion are usually very application- or system-specific. Hints can be supplied to the coherence system for prefetching and data placement by incorporating the hints into the coherence system's transaction set. The obvious example is in routing, where the headers of an incoming IP packet need to be matched against a table to determine the destination buffer/interface for that packet. These headers can be placed close to the lower cache levels by coloring transactions with hints, such as Read/Write Hit/Miss policies.
Synchronization and barrier operations: Many ISAs offer various atomic primitives that must be mapped onto the coherence system. LL/SC, for instance, is a common atomic primitive in modern ISAs.⁴ This form of atomicity is prone to livelock if not implemented correctly, and can lead to deadlock. Weaker memory systems require a safety net, called barriers, to force a certain behavior (usually between Stores and Loads issued from a processor or thread) during sensitive code sequences. This is generally achieved by inserting special barrier instructions supported by the ISA. The coherence system may need to respond to these operations by dynamically stalling certain transactions to support their behavior.