Software Rules The Day In Multicore SoC Design

Looking back over the past 10 years or so, semiconductor process technology more or less kept pace with the demand for functionality in large-scale processor-based ICs. When the next-generation set-top box IC needed more horsepower, a move from, say, a 180-nm process to 130 nm would provide the necessary boost by adding gates and the ability to run faster clocks. But that next-generation chip would still carry a single processor.

Things have changed dramatically in the last few years. Simply put, silicon scaling no longer meets functionality requirements. Thus, designers turned to multiprocessor architectures, which significantly up the ante in terms of processing power. The number of processors per chip is taking off, already exemplified several years ago by Cisco’s 192-processor engine for its CRS-1 network router (Fig. 1).

With the rise in processing power and complexity comes a host of issues that point largely toward the software side of the system equation. Writing software for a single-processor system is a relatively simple task, as a purely sequential approach will do the trick. But there’s little point in multiple processing engines if you’re not planning to have them execute instructions in parallel. How parallelism is imposed is the crux of the matter. Missteps can result in dire consequences, creating debug nightmares.

Fortunately for those looking to move to multiprocessor architectures in their system-on-a-chip (SoC) designs, tools and methodologies are beginning to appear. Designers can take steps to ensure that their parallelized application code won’t cause memory-access deadlocks, race conditions, or other faults that crash one or more processors or even their entire systems.

HOW MULTICORE LOOKS TODAY Looking at a generic example of a multicore SoC can illustrate both the complexity of the devices and the programming challenges (Fig. 2). In a hypothetical transition from a 130-nm SoC with a single processor to a multicore implementation at 65 nm, designers would have roughly four times as many transistors to work with.

Multicore architectures ramp up complexity in ways beyond simply having multiple processors. The availability of more gates brings added memory, which is required to handle the increasingly large amounts of data, high-resolution video streams, and other content. The increased bandwidth means more I/Os to deal with all the data. More complex control processing is required by a myriad of network stacks and more elaborate user interfaces. “Designs are using more CPUs,” says Chris Rowen, CEO of Tensilica. “But that has only limited potential because of the way control paths are written.”

When considering multicore SoCs, an important distinction must be made between control-plane and data-plane processing. “In the data plane, there’s strong interest in integrating more functionality,” says Rowen. “Chips no longer process only audio, video, or wireless baseband, but rather they process all of them. Meanwhile, there’s growing complexity in each of these various functions. This puts a lot of pressure on a more programmable solution.”

Efforts to make the most of multiple processors often run aground on the shoals of memory access conflicts. “The old paradigm for multicore designs using shared memory was if things were happening in parallel, you’d want them to touch the memory at different address spaces,” says Limor Fix, general chair of the 45th Design Automation Conference and associate director of Intel Research Pittsburgh.

“The idea is for parallel threads not to interfere with each other, and to minimize the number of clocks required for the shared memory,” says Fix. “If each of the parallel computations is touching a different area in memory, there’s less collision and less locking of the shared memory.”

The problem lies in the fact that visibility into the design is extremely limited. “Typically, when working with RTL simulation models of processors, software debug relies on the general-purpose registers of the processor,” says Jim Kenney, product marketing manager at Mentor Graphics. “These registers are usually exposed at the top level for tracing in the waveform window of a logic simulator.”

Making matters worse is the fact that there may be only one debug port for several processors. With all processors executing instructions concurrently, it’s very difficult to control the speed of any given processor.

Debugging is made even harder due to the absence of determinism. “With multiple processors, you don’t control what’s running,” says Michel Genard, vice president of marketing at Virtutech. Rerunning code often is of no value because the results can be different each time, making bugs hard to pin down. Then there’s the notion of “Heisenbugs,” or changes introduced by probe insertion that alter the system’s behavior.

GOING VIRTUAL Fortunately, there are ways around these issues, most of which come in the form of “virtualization” or “virtual platform” technology. Many benefits can be derived from virtual platforms (see “Multicore Design Benefits from Virtual Prototyping,” www.electronicdesign.com, ED Online 18637).

Once a virtual platform is assembled from hardware models, many of the issues concerning software debugging are addressed. The designer gains a great deal of control over the system, hence a return to a more deterministic scenario. The system configuration is easily varied in terms of the number and speed of cores as well as the software loads on each.

Virtual hardware offers a good amount of visibility in terms of memory, processor registers, and device states. In addition, when you synchronize the processors, you can synchronize everything at once. It also affords much more control over system execution.

Continue to Page 2

When debugging requires a global system stop, all processors stop simultaneously with no “skid” effect. When one processor is stepped through instructions, others can be made to sit and wait. Cores can be slowed or stopped entirely; communication latencies increased; and timing disturbances from breakpoints disappear.

Having said all of that, a sticking point for those wishing to assemble virtual platforms can be the models themselves. Where do they come from? What level of abstraction should they embody?

If your multicore design is starting from a good amount of legacy RTL, as most do, one answer to model creation comes from Carbon Design Systems, whose tools compile RTL into an executable software image. Compilation can be done on a blockby- block basis, on subsystems, or even on an entire system.

According to Carbon’s Bill Neifert, CTO, the models enable visibility into what’s happening in the system. “We provide some RTL simulator-like features,” says Neifert. “You can look at waveforms and see conflicts between processors contending for resources.”

Virtual platforms are also used by HW and SW development teams to determine applicable use cases for the system. Such is the case at Freescale Semiconductor, where extensive investigation of use cases is critical to the the company’s multicore SoC design.

“We spend a lot of time with our various teams, including marketing, verification, validation, software, hardware, and development tools, to decide on the priorities for the use cases,” says J.T. Yen, Freescale’s verification manager. “Then we take those use cases and drive them back out into the teams to make sure the hardware architecture is meeting those use cases.”

Virtutech’s Simics 4.0 is a virtualization environment that enables such usecase exploration. Version 4.0, released this month, adds APIs that support more use cases as well as a repository of thousands of models accrued since the initial release. of Simics.

Further, Simics 4.0 is itself a multithreaded application that enables, in a chickenand- egg scenario, designers of multicore SoCs to leverage all of the cores available on their computing resources (laptop or multiway server) to boost simulation speeds and scalability. This capability, embodied in Virtutech’s Simics Accelerator, enables one Simics session to simulate several machines in parallel (Fig. 3).

Another option for platform creation comes from CoWare's ESL 2.0 toolset. With CoWare’s tools, multicore SoC designers can debug and benchmark the platform-level performance of their IP and subsystem RTL at a cycle-accurate level of abstraction.

JUMPING THE HURDLES Taking the virtual-platform route has its advantages as just outlined, but there are also barriers to success. Building a virtual platform can be a laborious process that must be undertaken in parallel with the design process itself. Then there are the issues with interoperability of hardware models among various commercial flows.

Imperas is a relatively new entity that’s taken a somewhat different approach to its entry into the virtual-platform arena. Out of the chute, the company made a major technology donation that carries the promise of an open-source infrastructure for virtual platforms.

“When we started the company, we were targeting how to program multicore SoCs,” says Simon Davidmann, Imperas’ president and CEO. “But what we found was challenges in debugging. There was no broad simulation infrastructure to support it. The key is a modeling technology that would enable models to work together no matter who makes them.”

To that end, Imperas made three technology components freely available through its Open Virtual Platforms Web site at www.ovpworld.org, as well as at SourceForge. The first is C-language modeling application-programming interfaces (APIs) for processor, peripheral, and platform modeling.

The second is an open-source library of models written to the APIs. The models can be obtained as either pre-compiled object code or as source-code files. At present, the library comprises processor models of ARM, MIPS, and OpenRISC OR1K devices, with others to follow. Also available is a wide range of component and peripheral models. In addition, there are several example embedded platforms written in C, C++, and SystemC.

Continue to Page 3

Rounding out the trio is a free OVP reference simulator that runs processor models at up to 500 MIPS. Called OVPsim, the simulator comes with a GNU debugger (GDB) interface.

OVPsim can be called from within other simulators through a C/C++/SystemC wrapper. It also can encapsulate existing instruction-set simulator (ISS) processor models (Fig. 4).

DEALING WITH COMPLEXITY When it comes to the language used for writing embedded code for multicore SoCs, some designers feel that the existing paradigm is entirely broken. In other words, writing software in sequential fashion using C or C++ can no longer be a pragmatic approach. These days, new, fundamentally parallel languages and methodologies are required (see “Programming Multicore Platforms: What’s Really Going On?” ED Online 18639).

“Finding new design-entry languages that address parallelism is a long-term goal and is at least five to 10 years from being realized,” says Frank Schirrmeister, director of product marketing for system-level solutions at Synopsys. Today’s users, says Schirrmeister, are better served by virtual platforms with analysis and debug capabilities geared for multicore platforms.

Such languages and methodologies may eventually be forthcoming. But for now, a great deal of legacy sequential software is being transformed into parallel code, however laborious that process may be. However, tools are available that can help determine where opportunities for parallelism lie in sequential code.

One such tool is Critical Blue’s Cascade, which synthesizes reprogrammable coprocessors that accelerate native binaries or C/ assembler source code. Recently, the company extended Cascade into a multicore version that does the same thing, only with the addition of cross-core software partitioning, task-dependency analysis, and verification capabilities (Fig. 5).

“Multicore architectures are not new, but in the past they were usually created for a specific purpose,” says David Stewart, Critical Blue’s founder and CEO. “What we’re seeing now is multicore for the masses in the form of hardware architectures that can be used for multiple SoCs. That means reprogramming, and that comes down to software.”

In Stewart’s view, Multicore Cascade is a pragmatic approach that can help make today’s programming languages and techniques viable for multicore architectures. “When we only generated a single co-processor, we were extracting instruction-level parallelism,” he says. “Now, we are extracting task-level parallelism. But it goes beyond that and into analysis of where dependencies are in the code and what the benefits are of breaking those dependencies.”

DEBUG IMPROVEMENTS Functional verification of multicore SoCs is largely accomplished using processor-based tests. Verification engineers use full-functional, signoff-accurate processor models derived from RTL to drive bus cycles out to the rest of the design's IP. This method can be used for block-level verification or as a final simulation to ensure that the hardware will come out of reset and execute code.

The downside of processor-based testing, though, once again lies in the limited visibility for software debugging. Typical debug flows provide only a view of the processors’ general-purpose registers.

“The only interactive view with which to determine why a C test isn’t running properly on hardware is the waveform view,” says Mentor Graphics’ Kenney. “It’s hard to correlate misbehavior in the waveform view with where it’s happening in the source code for the test.”

Mentor Graphics’ attempt at a solution for this problem comes in the form of Questa Codelink, an extension to the Questa functional verification environment. “What we’ve done is to build a classical source-level software debugger into the ModelSim Questa environment and connected them both to the RTL processor models used in verification,” says Kenney.

With Questa Codelink, users have the advantage of interactive, graphical debug. In a source-code view, they can see breakpoints for an unlimited number of processors. Registers are displayed, and variable values can be tracked. A cursor in the source-code view is in lock-step with a cursor in the debugger’s waveform window. Moving the cursor in either window takes the other window’s cursor to the corresponding point.

According to Russ Klein, Mentor’s project manager for Questa Codelink, an important aspect of the tool for multicore SoC developers is the non-intrusiveness of the process. “You can see what’s going on with each of multiple processors without introducing any timing errors,” says Klein. “You can see it all concurrently, with full visibility into the states of each of them at exactly the same time. The ability to step backward through the code to the point where synchronization errors occur is also very powerful.”