Up SoC Performance With Configurable/Extensible Microprocessor Cores

Every engineer wants a microprocessor to fit an application like a glove. But nobody wants to design a micro for a specific application. Too expensive. Too much work. But today, engineers can get a tailored microprocessor without paying the design price. ASIC and system-on-a-chip (SoC) designers can configure or extend an existing CPU core to meet their application requirements.

Instead of a fixed RISC microprocessor, engineers can tailor a scalable core with add-on features. Some cores also are extensible. Developers can add new instructions, new functions, and new interfaces to them.

Adding a few key CPU resources or instructions can deliver huge dividends in performance, cost, and throughput efficiency, especially for embedded or telecom applications. One method for designers is to monitor their initial application code, find the expensive processing and inner loops, and then increase efficiency by adding special features, instructions, or coprocessor blocks.

These features can include complex math-processing blocks and complex multicycle instructions. These extensions all play under the umbrella of a standard microprocessor core complete with a software development chain that includes compiler, assembler, and debugger, as well as a full set of standard HDL-based, EDA tool chains.

Moreover, these configurable/extensible cores may play a major role in ASIC multiprocessing, especially for data-flow applications like telecom and video. These applications typically consist of "n" data streams that are processed in multiple process stages.

For example, in telecom, multiple lines are pumping in packets, which need stage-by-stage processing—identification, checking, and packet assembly/disassembly. This processing can be implemented as an array of processors.

On the X or horizontal axis, a line enters the array and passes through multiple processor stages. For multiple lines, these stages form processor layers in the Y dimension. Each Y layer consists of processors tailored to that stage's specific task. The resultant array provides staged processing for multiple lines, with each stage delivering high-efficiency processing.

Today's configurable/extensible cores include ARC's Tangent A4 and A5, and Tensilica's Xtensa architectures.

Not Rocket Science The first thing to understand is that configuring or extending a microprocessor core isn't rocket science. It's much easier than one would think. For one thing, the cores aren't super-sophisticated RISCs with long, complex pipelines and multi-issue superscalar execution. Rather, like ARC's Tangent and Tensilica's Xtensa, these cores are basic RISCs with relatively simple architectures and pipelines. They tend to:

Have short pipelines of four or five stages long;
Be simple scalar machines of one instruction issue per clock;
Be simple Harvard or Von Neuman implementations;
Have a core set of basic instructions;
Provide a larger set of additional instructions to add;
Have an expandable core register set;
Provide lots of bit manipulation and branch instructions;
Support I, D caches and memory options.

Today, most designers understand the basic RISC design methods and architectures. These soft cores fit right in. Most engineers will feel right at home adding defined features and instructions. Both ARC and Tensilica provide interactive architecture design environments that make it easy to add predefined features, like general registers or instructions. That can be done via pull-down menus.

Adding a simple instruction doesn't require a computer architect's expertise. For starters, the architectures are simple four- or five-stage pipelined RISCs. Most new instructions, which generally use existing datapath structures, can execute in a single "execute" cycle. Therefore, it's generally a matter of defining operations within a cycle.

The key to these configurable/extensible cores lies in two factors. First, because they're soft cores, all changes and additions can be compiled in to create a new expanded synthesizable processor core. Second, changes to the core architecture are accompanied by equivalent support in the software and hardware design tool chains. New registers and instructions are reflected in the core's assembler, compiler, and debugger. Tensilica, for instance, generates intrinsic functions to represent the additional instructions or instruction blocks added. Developers can use the "instruction" or "block" in early code to test it without waiting for the final silicon.

Both ARC and Xtensa target embedded applications. They both implement shorter 16-bit instructions for code compactness (smaller than ARC's 32-bit and Xtensa's 24-bit ISAs). They also implement built-in automatic loop control with loop instructions that set up a loop count and an inner loop boundary (start, end).

For instruction configurability and extensibility, both ISAs are structured to support instruction expansion. Each ISA implements multiple layers of instructions using multiple op-code fields (op code, sub-op code, etc.) in each instruction word. ARC implements one sub-op code field. Xtensa uses two.

Playing The Language Card But these cores are more than just tinker toys waiting for assembly. With the ARC and Tensilica cores, developers can add their own instructions, which can be complex. They can add multiple instructions, up to 256 for the ARC and up to 400 for the Xtensa architectures. Engineers and programmers have taken advantage of this to tailor processing according to their needs.

Both ARC and Tensilica provide the hooks for developers to define their own new instructions in HDL implementations—Verilog or VHDL. ARC favors a full language approach for its designers who get templates and aids to develop their own Verilog definitions. The developers write their own HDL code, then integrate it with the ARC core code for a full synthesis using standard EDA synthesis tools and tool chains.

Tensilica takes a more restrained language view. The company defines its own HDL, the Tensilica Instruction Extension Language (TIE). This Verilog-like language enables developers to define their own instructions in a text file, but the Tensilica tools integrate those descriptions with the full-core HDL for synthesis. Also, the tools create an intrinsic C function for each instruction so that developers can test the in-structions in a code setting.

In general, HDL-based hardware design can be a problem for programmers who think of statements as being sequential, not parallel as in HDLs, and who are unfamiliar with datapath architectures. But adding an instruction, which generally uses existing datapath, is very doable for programmers. ARC and Tensilica report that programmers have easily adapted to creating instructions as needed.

New instructions can have CPU implications. Most added instructions easily fit in the pipeline as single-cycle executions. But if the instruction takes more than one cycle, it can disrupt the pipeline. Tensilica, for example, supports multiple clock-stage instructions that lock the pipeline until the new instruction completes.

Another potential problem is interrupts. Both ARC and Tensilica allow high priority interrupts to have priority over such long instructions, halting them to take the interrupt. On return from the interrupt, the instruction can be recovered, or restarted on return, depending on its state and the architecture.

Developers can also add function blocks, such as interfaces or special peripherals, that feed into or support a special register. They can define the function block in HDL and add the special register interface. In the running target, the code interfaces with the register (using special Load and Store instructions in the ARC) to access the function block or interface. The only interlock necessary is to ensure that when the CPU does an access to the register in an instruction, the register is available.

Another valuable configurable/extensible feature is the addition of condition codes. ARC lets designers add condition codes and test off of them. For the ARC ISA, this is especially important, as all ALU instructions are conditional, predicated on an assigned condition code. Thus, designers can add an instruction and have it execute or not (predication) based on a new condition code. The logic behind the condition code, the logic conditions that set or reset the condition code, can be defined in HDL.

ARC ARC was the first microprocessor core successfully designed to be configurable and extensible. It fields a simple four-stage pipeline RISC, tailored for low- to mid-range, low-power applications. Its ISA is orthogonal, designed for compact coding. For example, all ALU instructions are conditional. They execute based on whether or not a specified condition code is set or not. Moreover, designers can define their own special condition codes (up to 16).

ARC's soft core is configurable with a number of architectural resource add-ons, and extensible—developers can add new instructions. The RISC architecture is scalar with a 32-bit ALU and an expandable register file of 32 to 64 registers (Fig. 1). The CPU supports I and D caches, and on-chip memory. These are configurable resources, add-ons to the RISC core. The Load/Store unit contains a register scoreboard to track registers waiting to be written into for delayed Loads. The Load/Store unit can be configured to support multicycle extension instructions.

Engineers and programmers can add up to 128 16-bit instructions and 128 32-bit new EXTENDS instructions to the core's ISA (for the ARC5). They can also add up to 16 new condition codes, an Auxiliary Register Set (another general register file), configuration registers, a separate Memory Controller, a separate Load/Store unit, a separate Interrupt unit, Host debug resources (JTAG, TAP controller, trace support), and Power Management.

Power Management includes Sleep Mode (halts pipeline, disables RAM) and Clock gating (switches off all nonessential clocks when CPU is halted or in the Sleep Mode). ARC also defines an Auxiliary Data and I/O space with its own Load/Store instructions for an additional memory and peripherals.

Tensilica Xtensa is a spiritual descendent of the pioneering MIPS RISC architecture. It's designed for compact, easy layout, and it uses the coprocessor model to add functionality to the CPU. Xtensa builds on a basic five-stage pipeline of fetch, decode, access, execute, and writeback (Fig. 2). Built to accommodate an MMU managed memory with I and D caches, it supports a processor interface and a peripherals interface.

The soft core is both configurable and extensible. It's built around an ALU with 32 32-bit general registers and a five-stage pipeline. It uses a unique register window scheme that makes 16 of the general-purpose registers visible. That register file extends from 32 to 64 registers. The architecture supports both I and D caches, with an MMU-based memory, all configurable options to the basic RISC.

Other configurable options include another 111 defined instructions, 16-bit and 32-bit multipliers, an FPU, a Write Buffer (up to 32 entries), interrupts and exceptions, a processor interface (PIF) for memory, a local peripheral interface (XLIM), up to four 32-bit timer/counters, and on-chip debug features, (JTAG port, comparators, trace facilities, etc.).

One major add-on block is the Vectra vector engine. This is a sophisticated 128-bit SIMD vector engine. It integrates four 16-bit multipliers, four 40-bit accumulators, four 40-bit (or eight 20-bit) ALUs, 16 160-bit vector registers, 16 16-bit scalar registers, and four 112-bit alignment registers. It can do four MACs or eight ADDs per cycle. At 200 MHz, Vectra does a 10-point complex FFT in 55.5 µs, or a 128-tap, 16-element IR filter in 3.1 µs.

ARC Cores
(408) 361-7800
www.arccores.com