Intel/Analog Devices' joint DSP design has crafted a new flexible ISA architecture. Analog Devices' (ADI's) BlackFin implementation delivers a 300-MHz, 16-bit DSP that supports dual MAC execution and low-power operation.
This is the latest of the fourth-generation DSPs that have emerged to power today's network, Internet-driven applications. Its competitors include ADI's TigerSHARC, Agere/Motorola's StarCore, and Texas Instruments' C6x. The new Micro Signal Architecture (MSA) will give them a run for their money. The 16-bitter can scale to 1 GHz and beyond.
A 16-bit, dual-MAC DSP architecture, MSA builds on ADI's high-performance VLIW, SIMD DSP architectures, and on Intel's memory management, power management, performance monitoring, and SIMDs. The resultant DSP implements dynamic power reduction, a memory management unit (MMU), and performance monitoring.
MSA delivers an innovative multi-instruction ISA that supports high-density 16-bit instructions, 32-bit immediate instructions, and 64-bit DSP (packed) instructions. It can execute two 16-bit MACs, two 32/40-bit arithmetic, a 32/40-bit shift or rotate, or four 8-bit video instructions per pipelined cycle.
This fourth-generation DSP targets high-performance, midrange 16-bit DSP applications. It packs enough memory on-chip—308 kbytes—for many tasks. In addition, the MSA supports low-power operation for portables and Internet appliances. Under software control, the core voltage and clock rates can be varied to cut power.
A balanced architecture, MSA supports both high code density and a simplified ISA. Listed are the keys to MSA's flexible instruction design:
- Load/store architecture: work from registers
- 16-bit basic instruction: high code density
- Extended 32-bit instruction: large immediates
- Combined instructions: multi-issue instructions
The DSP was designed around 16-bit instructions for high code density, and most control instructions are 16-bitters. But for operations that need larger immediate values or more fields, the ISA was extended to a 32-bit instruction.
For DSP operations, the designers added a 64-bit multi-issue instruction. A composite, this instruction is made up of two 16-bit instructions and a 32-bit instruction. This combination can specify complex DSP operations with two data loads, but it only takes one instruction fetch. Even better, it can make use of the same decode logic already implemented for the standard 16-bit and 32-bit instructions. The decoder takes in a 64-bit wide pluck and can issue one, two, or three instructions per cycle.
For speed, the DSP is pipelined with eight stages. Two stages execute the dual MACs that feed into dual 40-bit accumulators. The pipeline can start a dual-MAC instruction every cycle, delivering apparent dual-MAC executions per cycle.
This DSP core breaks down into separate addressing and execution sections. The addressing section incorporates dual data addressing generators (DAGs), supported by a pointer register file of eight 32-bit registers and an addressing register file. The latter has four entries. Each entry contains a set of four 32-bit registers—for indexing, modification, length, and base address. These four entries support four addressing contexts, minimizing interrupt context saves. The execution section consists of two 16- by 16-bit multipliers, two 32/40-bit ALUs, quad 8-bit video ALUs, a 40-bit barrel register, and dual 40-bit accumulators.
This is a load/store architecture. The next set of operands for the dual-MAC operations are fetched as two 32-bit words from the L1 memory (D cache, scratchpad RAM) and loaded into 32-bit data registers. These furnish the next X and Y values to the DSP execution units on the next cycle. For dual MACs, the 16-bit operands are grouped in 32-bit sets—two X and two Y 16-bit values.
Also, for higher processing bandwidth, the hardware performs SIMD operations—i.e., the same operation passed through the four 8-bit video ALUs. This tactic speeds up video pixel processing by four times. The ALUs also accomplish dual 16-bit ALU or 32-bit ALU operations and shifts.
On-chip memory has two levels or stages. Level one interfaces the CPU. It has a 16-kbyte instruction cache, 32-kbyte data cache, and 4-kbyte scratchpad SRAM. These memories have a two-cycle access. They can load two 32-bit data words and one instruction to the core per clock cycle. The second level of larger SRAM functions as a unified memory (I and D). The L1 caches can be configured as SRAM, or mixed cache and SRAM. They also support cache locking.
To speed accesses, the hardware supports relaxed ordering between Loads and Stores. Loads can take precedence. Also, there are two write queues from the CPU to L1 memory and from L1 memory to the system interface. Addressing is byte and word level.
Designed for C/C++ coding, the ISA supports two software stacks (user, system), held in the scratchpad RAM for fast access. Plus, unlike many DSPs, the MSA supports I/D MMUs for memory protection. It supports emulation, system, and user execution modes. For coding simplicity, the assembler implements an algebraic notation.
The ISA implements a Conditional Register Move to help eliminate branches. The instructions include word/bit manipulation, and Shift and Rotate. Also, the condition codes can be offloaded and restored. For DSP processing, the ISA supports zero-overhead looping with a Zero-Overhead-Loop instruction that sets the loop limits for inner-loop processing. For multiprocessing, the ISA supplies a Test-and-Set instruction (accesses, tests byte. If 0 sets Msb, returns byte).
Unlike some DSPs, all MSA instructions are interruptible. This is part of the design strategy to minimize the CISC-like features of classic DSP architectures, and to implement a RISC-like design to minimize design complexity. As a pipelined architecture, however, interrupts will introduce a bubble in the pipeline, taking eight cycles to fill.
The core Interrupt controller maintains an interrupt vector table (or Event Vector Table) with an entry for each interrupt or exception. These 16 entries hold the address of the ISR for the entry. The events include Emulation, Reset, NMI, Exceptions, Hardware Error, Timer, and eight general interrupts.
In this DSP, no special set of "shadow" registers exists to minimize interrupt overhead for DSP processing. If an interrupt may lead to inner-loop processing, the DSP context must be saved to let the CPU reinitiate the current processing loop. But there's some backup for loop control (one level) and for loop addressing (three levels). A complete context switch (saving all of the registers) takes about 100 cycles:
- Save/restore 44 registers: 88 cycles
- Return: three cycles
- Pipeline abort: eight cycles
Core interrupt latency can run from 15 to 100 cycles, depending on how many registers are saved. That's a bare bones number. For an RTOS that also handles task bookkeeping, timer overhead, and other maintenance chores, latency can run to 256 cycles.
This is the first DSP to have special hardware-performance registers. With them, the DSP can monitor specific conditions, such as cache misses or function activation. There are three hardware-performance register/counters. They are software addressable as memory locations with the MMRs.
For debugging, the chip uses a JTAG background debug port, supported by eight hardware breakpoints (six instruction, two data) and three frequency breakpoints (breaks on clock frequency). The hardware tracks the last 16 branches in an Execution Trace Buffer, which can generate an interrupt when it fills, enabling software to track program traces.
For low power, the DSP implements chip power management. It incorporates standard power-down modes. Even better, it implements dynamic power management by software control of the DSP core's voltage and chip frequency. The DSP requires a supplemental support chip to control the core voltage.
To cut dissipation, the software can drop the chip clock frequency through the on-chip PLL. The software can also lower chip voltage from 1.5 V to 0.9 V. But lower voltages require lower chip frequencies. At 0.9 V, the frequency must be dropped to 100 MHz. Thus, the software must lower the clock frequency first, and voltage next. Voltage reduces power as the square of its value (P = V2 × R). Combined, they produce a tenfold drop in dissipation. This is a millisecond-level control, as it takes about 1000 cycles to change frequency.