Optimized Processor Blocks Eliminate The Gamble With RISC For SoC Designs

The latest RISC CPUs, DSP cores, and software- definable processors give designers easy-to-integrate blocks that deliver top-notch performance.

Dave Bursky

May 1, 2000

27 min read

Add Us On Google

The ability to implement a system-on-a-chip has kept getting easier year after year. Those chips often house at least one CPU, and in many cases, a companion digital-signal processor. Until recently, though, the CPUs and DSP blocks that designers had available were somewhat rigid. Their feature sets were fixed, and they usually had a predetermined physical layout. That layout enabled the core providers to guarantee the processor's performance. But it kept system-on-a-chip (SoC) designers from really optimizing the chip layout. Fixed-size and -shape blocks cause restrictions on the rest of the chip's topology.

It's a new world now. The latest core releases have been optimized for SoC designs. Many are still available in "hard" form, with the physical layout predefined. But an increasing percentage come in synthesizable form, which lets the SoC designer control the implementation. The burden then falls on the designer or the synthesis tools to do the best job possible in implementing the core. The synthesized core's performance, or clock speed, can often be 10% to 50% slower than the core supplier's optimized version.

The choice facing designers is quite broad. Basic CPU cores are available in word sizes ranging from 4 to 64 bits. In the DSP arena, 16-bit engines are the most popular. Straddling the processor and DSP worlds are some merged CPU/DSP engines, along with most of the recently released very-long-instruction-word (VLIW) processor cores.

Memories And Peripherals Also Of course, a CPU or DSP core doesn't stand alone. Many companies provide fixed or compilable memory blocks that can be used for on-chip caches or other blocks. Plus, a lot of intellectual-property (IP) suppliers can deliver the peripheral functions for the desired I/O interfaces and control functions.

Along with those processors, memory, and I/O functions, the on-chip bus that interconnects all of the blocks has grown very crucial. As the operating speed of all the blocks increases, the speed at which data can move from block to block is rapidly becoming a performance-limiting factor.

The bus must be robust enough to tie into a wide variety of IP blocks with little to no additional logic. It then becomes the "universal on-chip backplane" to which all of the IP blocks can be tied.

Many large suppliers of application-specific ICs (ASICs) and application-specific standard products (ASSPs) have developed just such bus structures. Companies like IBM, LSI Logic, Motorola, Philips, and others offer silicon backplanes that are either developed in house or based on licensed IP.

To connect gluelessly to the bus, for example, both IBM and Motorola have independently developed on-chip buses and specifications. They've defined how third-party-designed blocks can be implemented. IBM's approach, dubbed CoreConnect, works with the superscalar PowerPC 440 core. System designers can craft SoC systems with a processor that hits 720-MIPS when clocked at 400 MHz. Its bus models and IP are available for licensing. Information can be found on the web site: www. ibm.com/microelectronics.

Like most other ASIC suppliers, LSI Logic spent much of its past crafting proprietary on-chip buses. The company is now taking an industry-standard approach, however, focusing on using the latest version of the advanced-microcontroller bus-architecture (AMBA) interface. Advanced RISC Machines (ARM) designed AMBA as the main core interface and coprocessor interconnect structure for its own family of ARM RISC processor cores.

Currently, close to 30 companies have licensed the ARM and AMBA interfaces, making AMBA one of the most widely used interfaces. LSI Logic is busy unifying all of the bus interfaces on the processor cores and high-end peripheral products that it offers, such as MIPS, ZSP, and ARM. By offering AMBA interfaces, it speeds system design. The company also hopes to leverage any IP developed by other ARM suppliers. It will simply be "bolted" onto the AMBA interface.

Two Buses Defined The AMBA 2.0 interface specification actually defines two buses. The high-speed processor interface is referred to as the advanced high-performance bus (AHB). The second, slower interface targets peripheral support. It's called the advanced peripheral bus (APB).

Both buses are single-edge clocked and they multiplex the address and data lines. But the AHB interface can be defined with bus widths as large as 1024 bits. It also will support up to 16 masters. The bus boasts a split-transaction protocol and a burst-access mode with programmable-block sizes.

Other players are offering alternative bus standards. Sonics Inc., Mountain View, Calif. (www.sonicsinc.com), has crafted a silicon backplane that can be used with IP from multiple sources. It allows data transfers of 640 Mbytes/s when clocked at 80 MHz. A guaranteed latency ensures that real-time deadlines can be met. (See "Tool Suite Is Strong Medicine For SoC Design Headaches," electronic design, Sept. 7, 1999, p. 37).

For the last couple of years, the Virtual Socket Interface Alliance, Los Gatos, Calif. (www.vsi.org), has been working on a silicon backplane bus-interface standard. Earlier this year, the committee released the first version, on-chip bus version 2, rev. 1.0 (OCB 2 1.0). It defines a standard virtual-component interface (VCI) for both IP suppliers and system-chip designers/integrators.

The group actually defined an interface signal set and a simple logic "wrapper." The wrapper can be designed to make any peripheral or add-on circuit functions compatible with the VCI standard. In the latest version of the on-chip bus specification, VCI-compatible virtual cores can operate with on-chip buses of varying protocols and performance levels.

The available choice of CPUs and DSPs spans a wide range of performance and power options. The largest amount of activity is clustered around the latest 32-bit RISC CPUs, as well as the 16-bit or scalable DSP building blocks. But 8- and 16-bit embedded-controller offerings also get around, such as the venerable Z80 from Zilog, the 65C02 and 65C816 from The Western Design Center, and many sources of the 80C51 controller originally developed by Intel.

The market is somewhat confusing, though, because the cores are available through many sources: independent IP developers, large design-tool suppliers, ASIC manufacturers and foundries, and even traditional semiconductor suppliers. In many cases, the rights to use the cores can be purchased. The cores are then transferred to the manufacturer of the designer's choosing. But more often, to get a processor core from an ASIC supplier or a foundry, the designer must fabricate the chips with that manufacturer. That's not necessarily bad, but it does restrict the ability to price shop.

If anything could be called traditional in this fast-paced area of technology, it's the traditional RISC CPU core supplier. Manufacturers such as ARM, MIPS, or Sun Microsystems provide licensable blocks of IP. ARM offers half a dozen core options with increasing levels of performance, ranging from the ARM7TDMI and ARM7T, to the ARM9 and 9E, and ARM10. The span even includes Intel's SA110, SA1100, and SA1500 StrongARM processors, which the company acquired when it purchased the assets of Digital Equipment Corp. These cores provide computational throughputs ranging from about 30 MIPS for the ARM7TDMI to over 300 MIPS for the StrongARM. Similarly, designers considering the MIPS family can select from a wide range of cores—from the low-end MIPS16 cores to the high-end 64-bit versions—thus providing a choice of performance that can optimally meet their system needs.

Initially, all of the RISC and DSP core vendors will offer "hard" versions of their cores. "Hard" cores have been completely synthesized, placed, routed, and optimized by the supplier. When added to an SoC design, they can be treated as simply an object.

As mentioned earlier, offering a hard design does allow the core supplier to guarantee all aspects of the core—speed, power, and area. Hard cores also are rigid, though. They have a fixed physical shape and are implemented with specific process rules. So as faster or more economical processes become available, they can't easily move from process to process. The final chip design would have to be manufactured only by processes approved by the core vendor.

Such manufacturing restrictions have pushed both core and ASIC suppliers to develop "soft" processor versions. These versions are often supplied in the form of an RTL/HDL file that can be integrated into the rest of the SoC design. After that's done, the entire design can be synthesized with the design rules supplied by the target foundry or ASIC supplier.

During the synthesis and layout processes, designers can make various choices that may affect the final placed-and-routed core. Initially, they might tell the synthesizer to optimize the implementation for one or more aspects, like low power, maximum clock rate, smallest chip area, or some combination of these aspects.

Of course, there are drawbacks to this approach. Perhaps foremost in the designer's mind is the issue of functionality once the core is synthesized and laid out. A lot of fingerpointing can take place, with each side blaming the other for mistakes when the SoC or core doesn't work. The issues of performance, power consumption, and area also come into play. The synthesis and most of the placement-and-routing tools usually run automatically with little human intervention. The results that they get, then, tend to be inferior to those achieved by the hand-optimized hard cores. But they can still be adequate for the desired application.

With the popularity of the ARM and MIPS CPU cores, more companies are forming to make their own versions of software-compatible implementations. Main core suppliers also have been inspired. They're adding core features or versions to better address evolving market needs.

To more easily move its core from process to process, LSI Logic developed a "soft" version of its licensed ARM7TDMI. The company can now offer higher-speed options than were possible with the hard core. That's because the soft core could be implemented on higher-performance processes.

Synthesizable versions of ARM's processor cores have also started to appear from ARM itself. In addition to revamping the ARM7 series for synthesis, ARM has developed synthesis-ready higher-performance versions, the 9 and 9E, which offer throughputs in the 80- to 100-MIPS range. The 9E also includes DSP-optimized instructions and support logic. Such features could eliminate the need for a side-by-side DSP core on the chip or a standalone DSP chip in the system.

The company's forthcoming ARM10 will offer the Thumb instruction mode, much like the ARM7TDMI and ARM9TDMI cores. But some optional building blocks will include a vector floating-point unit. That unit contains a multiply-accumulate pipeline comprising seven-stage ALU pipe and five-stage load/store pipe, as well as a cached integer core with large on-chip caches. The integer unit will execute the ARM version-5 instruction set and the Thumb instructions.

The version-5 set adds some operations to the command repertoire. The CPU will be able to count leading zeros, thanks to a new ALU instruction. Designed for process-independence, the core is initially targeted at 0.25-µm processes. It'll be able to clock at up to 300 MHz and deliver over 400 MIPS. The optional vector unit can deliver a sustained throughput of 600 MFLOPS.

A newcomer to the ARM camp, picoTurbo, has jumped into the core arena with two implementations: the pT100 and pT110. Both can execute the instruction set of the older ARM 4T. They'll be available for use with 0.25- or 0.18-µm design rules.

The pT110 adds configurable instruction and data caches that go from 2 to 64 kbytes each to the basic CPU core. A five-stage pipeline and a 32-bit Wallace-tree multiplier-accumulator are employed by the cores. The 0.25-µm implementations can run at clock speeds of up to 100 MHz. When implemented with the smaller design rules, core speeds will increase to 250 MHz.

The company actually is eying low-power and very area-efficient systems as its target. The pT100 core, for example, will consume just 0.45 mW/MHz and occupy an area of only 0.9 mm².

Similarly, in the MIPS processor camp, soft cores are making their mark. Its most recent core is the MIPS64 5Kc, a synthesizable 64-bit processor that implements the MIPS64 instruction-set architecture. It's backwards-compliant with the MIPS32 ISA. DSP instructions have been added, so single-cycle 16-bit multiply-accumulate operations can be done. The processor also can count leading zeros or ones.

Configuration Options It can be configured for cache size, ranging from 4 to 16 kbytes, as well as the number of ways for set associativity, the use of a TLB, scratchpad memory, and register-file size. When clocked at 375 MHz, the core can deliver a throughput of about 450 MIPS at 0.18-µm design rules. Excluding the caches, it consumes about 1.4 mW/MHz and occupies an area of 2 mm².

Several 32-bit versions of Lexra cores also can execute MIPS instructions. Due to patent issues, though, the cores don't execute the unaligned load and store instructions in hardware. The LX5280 version of the original LX4180 implementation has enhancements that extend the instruction set and architecture. They now include on-chip resources and instructions to handle DSP operations more efficiently (Fig. 1).

To better support DSP computations, designers at Lexra added a dual multiplier-accumulator that can perform two 16-bit or one 32-bit multiply every cycle. A second instruction pipeline also was thrown in to increase the instruction-level parallelism. That allows the superscalar core to simultaneously execute memory operations and DSP commands, improving inner-loop performance in DSP algorithms.

A second register and ALU were added to raise the resources available to compilers for DSP algorithms. Lexra's DSP enhancements have not been endorsed by MIPS since it already has plans to offer DSP support. The company wasn't endorsed by MIPS, which has its own effort to offer DSP support. But the 36 Radix DSP instructions offered by Lexra greatly improve the speed of algorithms that rely heavily on multiply-accumulate operations.

Aside from the companies that have specifically set themselves up to offer CPU or DSP cores, some standard silicon providers are willing to license some of their key cores to large customers. Motorola, Hitachi, Infineon, and STMicroelectronics are just a few. Motorola, for example, has licensed its MCore RISC processor to several companies. It requires very little chip area and is available either as a hard core or in synthesizable form. Licensees already include Atmel, which will use it in smart-card applications. Lucent has plans for it in communications systems as well, while Yamaha wants to use it for consumer products. Universities also are taking part, including the VLSI Design Education Consortium coordinated by the University of Tokyo in Japan.

The synthesized form of the M210 core also has been overlayed on a large FPGA. Though it provides functional verification, the FPGA typically achieves compute/throughput speeds about one-quarter to one-tenth that of a handpacked core, or 10 to 12 MHz. But further optimization should be able to increase the speed to between 15 and 20 MHz.

Bus Standard On The Web In addition to the core, Motorola has developed a standard on-chip bus definition that designers can download from its web site. Other company processors have been licensed, too, such as the ColdFire, the 68HC11, and even the M68000. In some cases, the processor licenses aren't just sold, but exchanged for IP to help Motorola build up its library of building blocks for SoC designs.

The M200 core can deliver a throughput of between 20 and 30 MIPS. The next core design, the M300, has a performance level of between 90 and 100 MIPS. That allows it to handle more complex tasks.

At the high end of the current family is the M500, which the company core hopes to sample later this year. That chip will deliver a throughput of roughly 500 to 600 MIPS. Aside from the base model M300, Motorola will put out versions that pack caches, a floating-point coprocessor, a memory-management unit, and more. The web site has specifics on this: www.motorola.com/mcore.

The company feels that distribution is one area that could open up licensing interest. Licensing cores lets distributors offer value-added design services. These services then help customers get chips to market a lot more quickly.

Totally Optimized ASICs In the world of ASICs, the ability to "have it your way" and optimize everything, all the way down to specifics like cache sizes and register widths, would be too overhead-intensive for most core suppliers. But for the engineers that have to totally optimize the features and even the instruction set of the processor they need, consider the technology offered by ARC Cores and Tensilica. Both companies have developed tool suites and processor architectures that can easily be modified almost by pointing and clicking on a Windows-like graphical user interface.

A basic 32-bit RISC processor architecture with a four-stage pipeline and a memory controller is ARC's solution (Fig. 2). Designers can add caches or a math coprocessor to the architecture. They also can modify the architecture by increasing or decreasing the number of register files, expanding or reducing the bus widths, adding new instructions, or deleting some commands. The processor includes 16 basecase instructions with an additional 12 variations. Programmers thus get a set of 28 arithmetic and logical instructions.

Also available are 16 dual-operand and 53 single-operand instructions. With them, designers can create application-specific instruction extensions. Core performance totally depends on the process used to fabricate the chip and the complexity of the final SoC solution. The basic core requires only about 8400 gates and can run at clock speeds up to about 160 MHz when fabricated on a 0.25-µm process. More advanced processes will deliver faster processors.

The company has finished developing its latest enhancements to the basic core, the ARC 3. It now includes an integrated set of modules for data movement and processing, together with tools that make it easy for programmers to tune the performance.

The tool suite houses a software library of DSP functions, along with a debug capability to handle multiple ARC cores in the same system and hardware breakpoints and watchpoints. Added DSP hardware extensions include a multiply-accumulate block, X-Y memories, saturating add/subtract functions, and instruction-cache options.

When the core homes in on applications like third-generation cell phones, designers can leverage a process such as the Coolib1.2-V technology from Xemics S.A., Neuchatel, Switzerland (www.xemics.ch). Without the cache, it's then possible to achieve power consumption of about 0.05 mW/MHz, with a top clock speed hitting around 40 MHz.

The key to the ease of development for ARC is the Windows-based tool suite that permits designers to configure everything. When new instructions are added and undesired ones are removed, the tools will even update the compilers and assemblers. That way, designers can rapidly develop the software needed by the application.

A high-performance 32-bit RISC processor also can be configured and deconfigured by the Extensa design system from Tensilica. Available as a synthesizable core, this processor leverages a Windows-based design tool suite. Targeted at processes of 0.25 µm and smaller, the Xtensa processor contains about 25,000 gates. It can deliver over 220 MIPS when clocked at 200 MHz, and employs a five-stage pipeline.

The base architecture consumes a very small section of the SoC silicon. When fabricated on a 0.18-µm process, it uses an area of just 0.7 mm². Even though it's a bit more rigid in its ability to alter features, the processor does allow the addition of caches. It supports on-chip coprocessors, a 16-bit multiplier-accumulator, and many other I/O solutions. (See "Tool Suite Enables Designers To Craft Customized Embedded Processors," Electronic Design, Feb. 8, 1999, p. 33.) When implemented with 0.18-µm design rules, the processor ends up with power consumption of less than 0.4 mW/MHz if powered by a 1.8-V supply.

The DSP world mimics many of the same core activities. A number of traditional DSP chip suppliers also offer their cores for use in custom designs. Among those holding up their DSP-optimized cores as building blocks to tackle different performance ranges are Texas Instruments, Lucent/Motorola (Starcore), Infineon, Analog Devices, Philips, and Zilog. Additional companies have developed DSP building blocks specifically as IP. Some of these include The DSP Group, 3DSP, Improv Systems, Massana, ZSP (now part of LSI Logic), and BOPS.

Wide DSP-Block Performance Range These companies will license their cores to either foundries or to system design companies that plan to embed the cores into larger SoCs. The range of performance spanned by the DSP blocks starts at about 20 MIPS or so for the low-end 16-bit integer engines to well over 200 MIPS for the higher-end coprocessor engines.

The MIPS performance measurement with DSP blocks doesn't really tell the full story, because one instruction may actually cause many other smaller operations to take place in parallel. Say the block has multiple ALUs or multiplier-accumulators that can each perform their computations in parallel with the other blocks. So rather than millions of instructions/s, many companies prefer a rating in terms of millions or billions of operations/s (MOPS or BOPS) to provide a better sense of the throughput available to tackle different applications.

Some of the more interesting "roll-your-own" DSP capabilities come from BOPS, Improv Systems, LSI Logic (ZSP), and Massana. BOPS has adopted a business model that's similar to what ARM has established—a licensing fee, royalty arrangements, and some service and consulting. The ManArray architecture is a highly repetitive array architecture of processing elements. It includes three levels of parallelism and a single-cycle switching fabric for moving data between processing elements.

The heart of the array is the BOPS2010, a configurable DSP core that comes in three versions. The simplest supports fixed-point, another handles 32-bit IEEE 754-compatible floating-point computations, and the third does both. When clocked at 200 MHz, the core can deliver 4 billion operations/s (for 16-bit computations), and up to 8 BOPS for computations on 8-bit data. The floating-point version can perform 240 million floating-point MAC operations/s. That high throughput would let the ManArray perform a 256-point complex integer fast Fourier transform in just 1824 clock cycles, or require just 19 cycles for a real, 16-tap FIR filter (16-bit data).

For programming the operations, company designers crafted an indirect VLIW scheme. At the first level of parallelism, it leverages long instruction words without some of the bit overheads that those instructions might imply. And with multiple processor blocks in the ManArray, the array can leverage the single-instruction, multiple-data nature of signal and image processing. Multiprocessing capabilities are made possible by adding ManArray blocks on the SoC or employing multiple ManArray-based chips to get more concurrent horsepower.

Although its architecture is considerably different, the ZSP approach from LSI Logic does leverage aspects of RISC CPU design. In doing so, it creates a high-performance, fixed-point, superscalar DSP building block. Those 16-bit integer blocks contain dual 16-bit multiplier-accumulators and dual ALUs. Multiple integer blocks can be co-integrated. They then form a small SIMD array, which speeds computations. The acquisition of ZSP provides LSI Logic with a high-speed compute block that can be used in many communications applications, such as base stations and voice-over-IP systems.

Support For Multiple Blocks A newcomer to the market has made its mark by crafting a VLIW-like architecture that can support multiple compute blocks. These blocks can be aimed at DSP applications. At the heart of this concept, Improv Systems put a repetitive compute block called the Jazz processor. That processor shares space with a programmable-system-architecture (PSA) platform, which is specifically targeted at SoC designs. Think of the PSA platform as an SoC that contains multiple instances of the compute block, or Jazz processor, as well as shared on-chip memory and programmable I/O blocks.

Each Jazz processor, in turn, contains multiple computational units (CUs) that can be ALUs, MACs, shifters, etc. Also found within it are multiple memory interfaces to local data memories and a task-control unit. The CUs include enhanced operations for DSP aIgorithms, like saturation, rounding, and absolute value.

A typical combination of compute units on a Jazz processor might include three ALUs, one MAC, one shifter, one counter, and one byte-swap unit. In addition, there would probably be a 32-bit datapath and a 32-word-deep task queue. Also included is a 240-bit-wide instruction is employed (Fig. 3). Designers will actually use a compressed mode, however, to reduce the memory image rather than store 240-bit-wide instructions. When clocked at 100 MHz, the processor block can deliver a throughput of 1.3 BOPS.

VLIW processors targeted at media processing have almost become the "in" thing to design. Though available as cores, several other VLIW engines can be had in the form of Philips' Trimedia processor, Infineon's Carmel DSP, and Fujitsu's FR 500. Philips has implemented a number of application-specific chips for HDTV and set-top boxes using the Trimedia engine.

The second generation of the Carmel core was released by Infineon earlier this year. It doubles the number of possible user-definable instruction extensions with its customizable long-instruction-set architecture. As for Fujitsu's latest, it certainly isn't lagging behind. The FR500 will focus on graphics and multimedia applications. It will be able to handle high-speed floating-point computations.

Although probably not intended for use in large multicore arrays, the Massana FILU-200 DSP coprocessor core provides a loosely coupled coprocessor block. That block can be co-integrated with a RISC or CISC processor core. The FILU200 block is a rudimentary processor that can deliver a throughput of 200 MIPS when clocked at 100 MHz.

The DSP coprocessor consists of three units: program control, address generation, and computational. In a typical implementation, the FILU block would be "surrounded" by program RAM and ROM, some data RAM, perhaps a Sin/Cos lookup-table ROM, a coefficient RAM, and some type of host interface (Fig. 4). Within the computational unit, Massana designers integrated dual multiplier-accumulators, each with their own accumulators and barrel shifters (see inset to Fig. 4).

With this approach, the company hopes to start a design paradigm. In past processor-plus-DSP approaches, designers would usually develop the code for the DSP block and the companion CPU block independently. Then, they would integrate the software for the two blocks. In Massana's approach, the CPU host and FILU block are thought of as a single processor with a lot of math resources. Code is developed based on the "unified" processor. When that code executes, the MIPS-intensive DSP functions are offloaded from the host engine to the FILU coprocessor. The company also has developed a library of preprogrammed DSP functions that can be tied into the host processor's software through C function calls.

Dedicated DSP cores from 3DSP and The DSP Group are two examples of full-function blocks that can be licensed and integrated. The DSP Group already has several generations of its DSP core licensed to both ASIC manufacturers and system design companies. The first two generations of cores, the Oak and Pine, were aimed at speech processing. They have throughputs in the 20 to 30 MIPS range.

A newer core, the Teak, comes in two versions. The one with a single multiplier-accumulator is the Teaklite. It can deliver about 40 MIPS. The other version, the Teak core, has a dual MAC for high-performance applications. It provides double the number of multiply-accumulates. At the high end of its DSP core family, the company offers the Palm DSP core, which can deliver a throughput of about 100 MIPS. Heavily used in audio-channel processing in cell and cordless phones, the Oak and Pine cores are widely licensed to both system design companies, as well as to ASIC suppliers like LSI Logic and Philips. The cores were actually acquired by Philips when it purchased VLSI Technology.

Attacking the higher-performance end of the DSP core market, the SC140 StarCore engine was jointly developed by Lucent and Motorola. It can deliver about 400 MIPS in its first implementation. The core includes a compute block with an array of four multiplier-accumulators. These can make short work of compute-intensive algorithms.

Claiming to deliver the equivalent of 3.2 billion RISC-equivalent instructions/s or 600 MAC operations/s, the SP-5 core developed by 3DSP Inc. still manages to keep power consumption to less than 300 mW. Employing an approach the company calls superSIMD, or superscalar instructions/multiple data, the core has a novel memory-to-register-file capability. It also contains the load/store architecture found in many DSP architectures.

It's fully synthesizable and is based on VHDL RTL description. When implemented with 0.18-µm design rules, the core occupies an area of merely 1.5 mm² and consumes about 150 mW.

For applications that don't demand such high computational throughput, 3DSP offers the SP-3 DSP core. This version provides about half the throughput of the SP-5. The lower throughput is a result of a reduction in resources. The core packs a single MAC instead of the dual MAC computational block.

Of course, the vendors that offer standalone DSP chips also do custom-chip development with select customers. These include Analog Devices, Infineon, Lucent, Motorola, Philips, and Texas Instruments, to name a few. At this point, Texas Instruments hasn't licensed its DSP cores, but it does do a lot of joint custom designs. The StarCore DSP engine developed by both Motorola and Lucent Technologies is another instance in which each company will work with customers to create custom solutions.

When integrating multiple cores on a single chip, the biggest challenge has to be getting them to interoperate and communicate with each other. The longer it takes to iron out problems, the less chance that the market window for the product will still be there.

"System Platforms" Available To make it easier to hit the market, designers at VLSI Technology, along with some other companies, developed chips that are loosely referred to as "system platforms." They're usually highly integrated solutions that contain every possible function related to the target application: cell phones, MP3 audio, digital cameras, etc. Because the chip has all of the commonly used blocks and then some, it's a known working solution. This alleviates some of the configuration troubleshooting.

When pinning down a specific version of the chip to an application, designers would first "deconfigure" the chip. This meant removing logic functions that won't be needed and adding in any custom functions that couldn't be implemented with the logic pre-integrated on the platform chip. Palmchip has done something like this with its CoreFrame architecture. In it, there are predesigned generic solutions for the disk drive and other markets. The PalmPak SoC platform contains all of the peripheral functions typically needed: an ARM CPU, bus controller, memory-access controller, power-management logic, interrupt controller, programmable I/O lines, UART, and DMA controller.

As mentioned previously, VLSI Technology had created several system platform chips based on ARM CPU cores and targeted at the wireless communications market. Philips has taken that approach further with the NAPA architecture. Again, it is an approach to create or emulate a platform chip that might typically contain both a CPU and a DSP core, as well as a memory controller and various peripheral functions (Fig. 5).

The basic premise starts with a dual-processor architecture, with each processor controlling a peripheral bus through some type of bridge controller. It can be further broken down through additional bridge circuits that tie the processor blocks to the peripheral sections. Special "tunnel" circuits are used between blocks to provide a high-speed buffer that allows pieces of the system to be separated from each other.

That separation is necessary to realize the next and final goal of the NAPA project: The creation of a breadboarding system using a card cage that lets the system subsections reside on individual cards that plug into a rack/backplane. This permits probes and logic analyzers to be attached. The system functions—peripheral I/O, CPUs/memory, and customized ones—can be implemented on cards filled with FPGAs. Then they just plug into the rack for quick prototyping. The prototype can be deconfigured for a final system-on-a-chip to be created.

Suppliers Of RISC And DSP Cores

3DSP Inc.
(949) 260-0156
www.3dsp.com

Advanced RISC
Machines Inc.
(408) 570-2200
www.arm.com

American
Microsystems Inc.
(208) 233-4690
www.amis.com

Analog Devices Inc.
(800) 262-5643
www.analog.com

Arasan Chip Systems Inc.
(408) 985-9495
www.arasan.com

ARC Cores
(408) 360-2120
www.arccores.com

Atmel Corp.
(408) 441-0311
www.atmel.com

Billions of Operations
Per Second (BOPS) Inc.
(650) 330-8407
www.bops.com

Cadence Design Systems Inc.
(408) 943-1234
www.cadence.com

Cypress Semiconductor Corp.
(408) 943-2600
www.cypress.com

Epson Electronics America Inc.
(408) 922-0200
www.eea.epson.com

Eureka Technology
(650) 960-3800
www.eurekatech.com

Faraday Technology Corp.
(408) 235-8888
www.faraday-usa.com

Fujitsu Microelectronics Inc.
(800) 866-8608
www.fujitsumicro.com

Hitachi Semiconductor
(408) 433-1990
www.hitachi.com/semiconductor

IBM Corp.
Microelectronics Division
(800) 426-0181
www.ibm.com/microelectronics

Improv Systems Inc.
(978) 927-0555
www.improvsys.com

Infineon Technology
(408) 501-6000
www.infineon.com

Infinite Technology Corp.
(972) 437-7800
www.infinite-tech.com

InSilicon Corp.
(408) 570-1000
www.insilicon.com

Integrated Silicon
Systems Ltd.
(408) 441-1248
(44) 28 9050-4000
www.iss-dsp.com

Intel Corp.
(408) 765-8080
www.intel.com

Kawasaki LSI USA Inc.
(408) 570-0555
www.klsi.com

Lexra Inc.
(781) 899-5799
www.lexra.com

LSI Logic Corp.
(800) 574-4286
www.lsilogic.com

Massana Inc.
(408) 871-1415
www.massana.com

Mentor Graphics Corp./
Inventra
(408) 436-1500
www.mentor.com

Metaflow Technologies Inc.
(858) 452-6608
www.metaflow.com

MIPS Technologies Inc.
(650) 567-5000
www.mips.com

Mitsubishi Electronics Inc.
(408) 730-5900
www.mitsubishi.com

Motorola Inc.
(512) 895-2000
www.digitaldna.motorola.com

National
Semiconductor Corp.
(408) 721-5000
www.national.com

NEC Electronics
America Inc.
(408) 588-6000
www.nec.com

OKI Semiconductor Corp.
(800) 832-6654
www.okisemi.com

Palmchip Corp.
(408) 487-8696
www.palmchip.com

Patriot Scientific Corp.
(858) 674-5000
www.ptsc.com

Philips Semiconductors
(408) 991-3622
www.philips.semiconductors.com

PicoTurbo Inc.
(408) 586-4720
www.picoturbo.com

Sand Microelectronics
(see InSilicon Corp.)

SIS Microelectronics Inc.
(561) 989-3213
www.sismicro.com

STMicroelectronics Inc.
(602) 485-6100
www.stmicroelectronics.com

Sun Microsystems Inc.
(408) 544-0417
www.sun.com

Tensilica Inc.
(408) 986-8919
www.tensilica.com

Texas Instruments Inc.
(214) 995-3333
www.ti.com

The DSP Group
(408) 986-4300
www.dspgroup.com

Toshiba America
Electronic Components Inc.
(408) 526-2626
www.toshiba.com/taec

UTMC Microelectronic Systems
(719) 594-8035
www.utmc.com

VAutomation Inc.
(603) 882-2282
www.vautomation.com

The Western Design Center
(480) 962-4545
www.westerndesigncenter.com

Virtual Chips
(see InSilicon Corp.)

Zilog Inc.
(800) 662-6211
www.zilog.com

About the Author

Dave Bursky

Technologist

Dave Bursky, the founder of New Ideas in Communications, a publication website featuring the blog column Chipnastics – the Art and Science of Chip Design. He is also president of PRN Engineering, a technical writing and market consulting company. Prior to these organizations, he spent about a dozen years as a contributing editor to Chip Design magazine. Concurrent with Chip Design, he was also the technical editorial manager at Maxim Integrated Products, and prior to Maxim, Dave spent over 35 years working as an engineer for the U.S. Army Electronics Command and an editor with Electronic Design Magazine.