Multicore Projects Mean Multiple Choices

When it comes to multiprocessing, what’s good for the hardware goose is not necessarily good for the software gander. The ideal hardware architecture for a multicore design is a heterogeneous (asymmetric) single instruction-set architecture (ISA) that essentially includes both high- and low-complexity cores to achieve lower power and higher throughput, somewhat mitigating Amdahl’s Law1.

Now imagine that Amdahl’s Law (used to find a system’s maximum expected improvement when only part of the system is improved) was of no concern and we had unlimited die sizes. The ideal multicore from a programming perspective would be homogeneous (symmetric), so dependence wouldn’t be built up on a specific ISA. Courtesy of IBM, Sony, and Toshiba, the Cell microprocessor has a heterogeneous architecture— though it isn’t a single ISA. Yet programming the device can be rather arduous, leaving you with code that’s heavily architecture-dependent. According to Dave Haas, principal architect at Raza Microelectronics, you should be careful not to pigeonhole yourself into a given vendor or architecture when you can avoid it, making homogeneous architectures a safer bet when given a choice.

Regardless of the best approach, there’s a limited number of options for today’s embedded and general-purpose system designers. If you’re in the embedded space, several of the multicore choices are heterogeneous. If you live in a general-purpose world, you might only be able to get a homogeneous multicore.

DECISIONS • When it comes to multiprocessing, several tradeoffs exist that squeeze the most performance out of your transistor (see the table). For example, there’s the threadversus- core tradeoff. According to Kevin Kissell, MIPS principal architect, you must start by analyzing your system to determine which applications can be decomposed into a number of constituent tasks or threads.

“Parallelization of monolithic applications is often possible, but seldom easy, and it’s generally easier for a big scientific code than a small embedded real-time application,” says Kissell. And to save on area, consider utilizing a more thread-heavy architecture. The idea is to maximize the performance per watt and choose an architecture that will saturate the memory and power envelope.

“To the extent that a single-threaded core cannot keep its pipeline fully utilized because of delays from memory and slow functional units, multithreading can extract throughput with a relatively modest increase in area, and in many cases the payback is superlinear,” he says.

For instance, you might achieve 30% more throughput for 15% more area in the CPU and cache subsystem. “This can be converted into a power optimization if that recovery of lost bandwidth allows the multithreaded core to run at a lower frequency than an equivalent single-threaded core, and still meet performance targets,” says Kissell.

So if your application doesn’t require significant amounts of shared data or instructions, a distributed memory scheme is probably the best candidate. “Each processing element’s memory can be sized to its dedicated tasks,” Kissell says, “and one can use different processor frequencies, different processor models, and even different processor architectures for the different processing elements to achieve the best area/power/performance values.”

But if there’s an abundance of code and/or data sharing, a symmetric configuration may be your best bet. According to Kissell, this approach “adds complexity and loses a bit of peak performance relative to a distributed memory model, because there will be some contention for the shared memory array, and because a cache-coherency protocol must be used among the cores to ensure that they all see the same values at each memory location, despite the presence of caches.”

But according to Chuck Moore, senior fellow for Advanced Micro Devices, end users may have misaligned expectations about multicore technology.

“Multicore is very good for throughput and responsiveness, but given that most applications are still serial, these actually won’t speed up on multicore,” says Moore. “Over time, there will be an increasing number of parallel applications available, but this is going to take more time than people seem to realize.”

DIFFERENT VIEWS • When it comes to multiprocessing, all “coaches” believe their team has the best strategy for winning (see “Multicore My Way” at www.electronicdesign.com, ED Online 14631). Take AMD and Intel, which have gone public about their opposite approaches to next-generation cores. Intel believes homogeneous cores are the way to go, while AMD believes the future lies in heterogeneous cores.

“Multicore solutions of tomorrow will be heterogeneous,” says AMD’s Moore. “They will initially involve the use of architecturally compatible cores with varying capabilities, but will grow to include more special-purpose and power-efficient hardware that is accessed through well-defined APIs (application programming interfaces).”

Intel and Vivace Semiconductor also have radically different views of the embedded space. “Intel’s Embedded and Communications Group estimates the percentage of multicore designs that will utilize asymmetric multiprocessing (AMP) in the next three to four years of all Embedded and Communications Group-deployed multicore platforms to be about 10%,” says Edwin Verplanke, platform solution architect with Intel’s Embedded and Communications Group.

Continue to next page

“Once the core count meets 32 and beyond, the adoption of AMP may grow,” Verplanke adds. “Some of our customers have proprietary, often real-time operating systems that are not SMP-capable (symmetric multiprocessing). Those customers may be interested in running specific functions on separate cores. Those functions could include forwarding engines, cryptography, pattern matching, etc.”

This is in stark contrast with what Cary Ussery, president and CEO of Vivace Semiconductor, believes. He says that AMP makes up about 90% of all embedded multicore designs. Should it be surprising us that two professionals from different organizations have the exact opposite view of the market (see “Symmetric Multiprocessing Vs. Asymmetric Processing” at www.electronicdesign.com, Drill Deeper 17693)? Or is this just another example of an industry segment plagued with semantic problems (see “The Semantics Of Multiprocessing,” Drill Deeper 17694)?

SYSTEM OPTIMIZATION • Once you’ve chosen the architecture for your next system, assuming a multiprocessing environment, you’ll likely need to review your code to determine how to naturally take advantage of multiple cores and/or threads.

Heterogeneous multiprocessing requires an up-front understanding of how to best partition your application code to exploit the available threads/cores. In other words, how can your application best be broken up into smaller pieces? Homogeneous multiprocessing generally has no such requirement, since the operating system will handle most of the partitioning based on some basic task definitions and up-front tweaks.

Part of parallelism today is virtualization and knowing when to use it. According to Intel, if your legacy code has low performance requirements, it may be a good candidate for virtualization. But Rick Hetherington, Sun’s CTO of Microelectronics for the Niagara program, offers a slightly different opinion.

“It doesn’t make sense to virtualize a single core,” says Hetherington. Of course, Sun’s perspective is likely more relevant in the general computing space. The embedded space allows for virtualization of even a single core when the complexity permits it.

If you’re new to a multiprocessing environment, consider trying out incremental “what-if” scenarios to find bottlenecks and candidates for parallelization. You may also find the need to port your code to a standard operating system that’s designed to take advantage of multiprocessing architectures, such as Linux.

If porting millions of lines of code isn’t an option, a hypervisor may be your best bet. Another approach is to offload common tasks from cores, such as data encryption and decryption. This will free up the core for more general-purpose tasks.

MULTICORE’S FUTURE • Anant Agarwal, professor at the Massachusetts Institute of Technology and CTO of semiconductor startup Tilera, said at this year’s Multicore Expo in Santa Clara that the tools to program and debug multicore ICs are in the “dark ages.” Apparently, quite a few unemployed cores and threads are out there looking for work. But the problems aren’t just related to tools.

“First-generation multicore processors have been a simple integration of a group of cores into an SoC (system-on-achip),” says Dan Bouvier, director of Solutions Architecture for AMCC. This has translated to rather poor performance scaling due to the overhead required to handle multiprocessing and memory bottlenecks.

“The forthcoming generation of multicore processors will need more attention toward interprocessor dynamics and how they impact the software deployment and performance,” says Bouvier. “The primary challenge in integrating upper-layer (above layer 3) accelerators in asymmetric multiprocessor subsystems is the lack of standard, agreed-to APIs.”

Such a standard exists for computer graphics in OpenGL, which defines a cross-language and cross-platform API for producing applications that produce 2D and 3D graphics. Unfortunately, with no tool standards built around open-source APIs driven by industry experts across multiple segments, we have to work with what’s available today and perhaps rethink our design strategies.

“The programming model and software stack are the key enablers (or inhibitors) for taking multicore to the next level,” says AMD’s Moore. “By working closely with our software colleagues, we will come up with solutions that offer tremendous value to our customers.”

And what’s happening on the software front? “There is a fundamental shift in multiprocessor design, with an associated change in the software paradigms and models used, as multicore, coherence, and formal interprocessor communication schemes are adopted,” says John Goodacre, program manager for multiprocessing at ARM.

So not only is this shift causing a general rift in the embedded community, it also forces the systems engineer to rethink the decision process. “There are principle changes across the hardware and software as SoC designers consider the move from ARM plus DSP to multicore plus DSP plus accelerators plus RISC and the challenges of memory coherence, consistency, and task synchronization,” says Goodacre.

Continue to next page

Part of this fundamental shift by systems engineers must be to rethink their design approach. If it’s a bottom-up approach in which the processing requirements are determined based on performance, memory, and other system-related parameters, it could spell disaster downstream.

“If you are thinking about business application software, you think from the top down, from the software to the hardware. In the embedded space, we still think from the bottom up, which creates slow development processes and missed opportunity because the product gets out too late,” says Michel Genard, vice president of marketing for Virtutech.

According to Genard, around 50% of embedded designs never see the light of day because of this flawed way of thinking, and designs are driven based on performance parameters and not business requirements. “Instead, we need iterative hardware and software development that speeds overall timeto- market,” says Genard.

To improve your chances, consider a system-level virtualizing approach to the software rather than a componentlevel approach (see the figure). When done right, Genard notes, this approach “provides the speed, scalability, and control necessary for successful concurrent software/hardware development.”

HERE TO STAY • As more silicon is delivered with multicore architectures, marketing departments for companies large and small will continue to find uses for them. Therefore, we must embrace multicore and continue to research the ideal architecture/ software mix to stay on the leading edge.

To that end, it would appear all paths for multicore lead to parallel programming, along with more sophisticated architecture solutions for intra-processor communications and the promise of software transactional memory. According to Agarwal, we must change how cores are connected and determine the ideal size of resources, arguing that distributed meshes and smaller cache sizes are the wave of the future.

With so many problems plaguing multicore, there’s a huge potential for startups to take the lead and maybe one day find a job for all of those unemployed threads and cores.