Handsets employing 3G wireless, and beyond, are expected to have a large number of new features and functions, such as MPEG-4 video and audio. These compute-intensive tasks place major demands on digital signal processors (DSPs)—the IC workhorse of conventional wireless and mobile communications designs. Requirements include higher performance, considerably lower power consumption, small silicon area, low cost, architecture flexibility, and faster time-to-market.
Facing these demands, designers can easily conclude that power-hungry DSPs with conventional instruction-based architectures don't suit next-generation mobile and wireless handsets. This calls for a paradigm shift to a new class of IC—the adaptive-computing machine (ACM).
To illustrate the basic design tradeoffs and deficiencies a DSP poses, consider the baseband processor of a wireless handset. The baseband processor comprises a general-purpose RISC microprocessor, one or more DSPs, and one or more ASICs. To help minimize design cost (and the cost of finished silicon), it's prudent to execute as much functionality as possible on the DSP. But there's never as much DSP processing power as needed. This lack of power will become more problematic for the growing computational demands of future 3G and 4G handsets.
Consequently, design decisions are made about which portions of the design execute best in the RISC microprocessor, DSP, and ASIC portions of the system. The goal is to perform the least amount of the design in an ASIC because of its long, difficult, and costly design, test, and fabrication process. If errors happen in the implementation, or in the original algorithm, an ASIC provides no ability to correct problems in the finished design without significant delays for redesigning, retesting, and processing new silicon.
Additionally, errors occurring early in the design process significantly affect system performance, IC cost, total development cost, and ultimately, time-to-market. Fortunately, an ASIC supplies more processing power than a DSP for the specific algorithm implemented. An ASIC also is more efficient in terms of power consumption than a DSP. So, allocating tasks to an ASIC can prolong battery life.
There are DSP design considerations too. The goal is to load the DSP with as much computation as possible. But during the design-partitioning stages, a designer can only estimate the computational requirements (MIPS) each algorithmic element will demand from the DSP. The exact level of MIPS is unknown until the C- or assembly-language programs are written.
Limited Choices Available: Two design choices exist if the estimate is low, or if the DSP doesn't have the predicted processing power. One is to include a second DSP, and the other is to assign more of the design to the ASIC. Both options lead to an IC redesign. Furthermore, a DSP can't effectively perform certain algorithms due to its lack of power or flexibility, such as motion estimation, discrete cosign transform (DCT), and Viterbi decoding. These are usually assigned to an ASIC.
Although DSPs are becoming more powerful, frequently they still can't keep pace with the computing demands placed on them. In many cases, ASICs may be the only choice available to implement most 3G air interface requirements. However, many 3G techniques, like adaptive encoding and modulation schemes, aren't well suited for ASIC implementations.
These techniques don't constitute a single algorithm, which makes for a compact ASIC design. Instead, they involve a wide range of algorithms, selectable according to the type and volume of transmitted traffic and channel conditions existing at each instance in time. These multiple algorithms dramatically speed growth in the amount of ASIC silicon required.
The adaptive-computing machine referred to earlier introduces a more efficient way to implement 3G (and beyond) wireless and communications designs. Its architecture uses CMOS silicon more efficiently to extend computing performance beyond conventional methods. In adaptive computing, algorithms are mapped directly onto dynamic hardware resources. This technique gives the designer the most efficient use of hardware in terms of cost, silicon area, performance, and power consumption.
In an adaptive-computing chip, the vast majority of gates is employed to solve the computational problem. The control circuitry consumes a minority of the chip's total gates. In contrast, a DSP uses about 5% to 10% of its gates for actual tasks, while the other 90% goes toward the control overhead of decoding and managing instructions.
Basically, the adaptive-computing chip adapts its architecture on-the-fly at thousands of times per second. This lets a wireless handset perform different functions and execute a variety of protocols (Fig. 1). The adaptive architecture brings into existence the exact hardware implementation an algorithm requires, for as long or short a time period as required—clock cycle by clock cycle if necessary. This approach yields a 10- to 100-times performance increase over a DSP, with only one-half to one-tenth the power consumption.
The ability of the adaptive-computing chip to both spatially and temporally segment (SATS) itself further distinguishes adaptive computing from a DSP. SATS is the process of adapting dynamic hardware resources to rapidly perform various portions of an algorithm in different segments of time (temporal) and in different locations (spatial) on the fabric of the adaptive-computing chip.
Adaptive computing ac-complishes the same amount of processing (or more) that a DSP does in less time, consuming considerably less power as the silicon is used for a shorter time. This equates to higher performance, as the following example illustrates. Pseudocode for a programming language loop construct (PLLC) is used in this instance with the instructions: piReadAddr (generate next address), piRead16at (read from input), and add16 (accumulate input).
A DSP relies on serial operations through the loop to write the PLLC— xaddr=, x=, y=, and so on—until the sequential order goes back to the top of the loop. Each instruction is repeated until the loop condition is met. For instance, if a design calls for this dataflow to loop three times, the result would be nine sequential operations and nine clock cycles (Fig. 2).
Adaptive computing re-duces these operations by overlapping the iterations. Three iterations of the loop are accomplished in five clock cycles, slicing off four clock cycles from the DSP sequential operations (Fig. 3). While this particular savings may appear minor, these loops operate many times through numerous iterations. So if this particular loop operated 100 times, a DSP would incur 300 sequential operations, whereas adaptive computing would perform 70 to 80.
Design Flow: The DSP- and adaptive-computing-based de-sign flows have some common elements and several fundamental differences. The generic design flow is applicable to any portion of an overall wireless system, like a vocoder subsystem. System decomposition, reference-system design written in C/C++, and the first simulation framework are similar design steps for both a DSP- and an adaptive-computing-based design (Fig. 4).
In the DSP-based design, the initial and highly accurate floating-point system is first converted to a less-accurate, 16-bit fixed-point implementation. The precision of floating point usually isn't necessary. In most cases, 16-bit fixed-point operation is adequate for signal processing. The cost, size, and power consumption of a fixed-point DSP are all considerably less than a floating-point equivalent. Inevitably, due to the inherent reduced precision of fixed-point math, portions of the 16-bit DSP design must be converted to 32-bit fixed-point operation (double precision) to increase precision (Fig. 4, again).
The problem is that 32-bit operations run much slower on the 16-bit machine. For example, a two-number multiply in 16-bit fixed-point math takes three instructions on a 16-bit machine. However, the same multiply with 32-bit precision—on a 16-bit machine—takes as many as nine instructions. Also at this design stage, other techniques such as shifting are applied to maintain the necessary precision. In effect, these adjustments aimed at retaining greater precision are more of an art than a science. The result is usually a 16/32-bit hybrid solution, which undergoes a second simulation in the design flow.
Resource mapping comes next. Ideally, all clock cycles on every processing unit are used for DSP operations. Anywhere from one to eight compute units are simultaneously available in a DSP for this purpose.
Multiply-accumulate (MAC) in one clock cycle is the most often used operation. During this design step, the bus system poses a bottleneck in many in-stances. The issue is that all compute units need to store many operands and results in one clock cycle, over one bus, into one memory. Often this data connectivity cannot be achieved with DSP. The consequence of this bottleneck is unused compute units, or "dead" silicon, which continues to dissipate power.
At the third simulation stage, a number of key determinations are made to ensure that the design has the targeted performance, buses are available, all computing units are fully used, and so forth. Usually, some operations remain too slow because C language doesn't always map efficiently to a DSP's architecture. This is where assembly-language optimization comes in. C-language instructions are rewritten in assembly language so that the design can map more closely to the DSP architecture.
Differing Design Flows: The design flow between a DSP and an adaptive-computing machine begins to differ at the 16-bit fixed-point conversion. At this stage, adaptive computing starts to pay major design benefits. At the 16-bit fixed-point conversion stage, bit accuracy becomes a moot point because adaptive computing doesn't only offer a fixed 16-bit machine that can simulate 32-bit instructions. On the contrary, adaptive computing is a variable-precision machine that reduces the limitations on bit widths. A design doesn't have to be 16 bits. It can be 1, 8, 16, 24, or 32 bits or larger, offering considerable flexibility over the highly restrictive DSP 16-bit fixed-point conversion.
If certain operations require higher precision, that particular information will be specified in the high-level language (HLL), and no performance penalty is associated with ordering the higher bit precision. Also, the cost of going from 16 to 32 bits is much lower than with a DSP. For example, with an ACM, structures handling either 16- or 32-bit multiplications can be created. Each structure is equally efficient with virtually no design penalties. The program performs at the same speed whether it's a 32- or 16-bit execution.
The difference is that more computational resources are used in the ACM, but not more time. This form of conversion brings smoother simulation. Even if errors occur in the conversion process, simulation remains simpler than with a DSP because increasing precision doesn't generally make an impact on time necessary to execute an algorithm. By using adaptive computing at the resource-mapping stage of a design, the designer has ample re-sources and subsequent performance as these resources operate simultaneously and asynchronously.
Inherently, a DSP executes one in-struction at a time and a series of sequential instructions to perform a given function. On the other hand, adaptive computing performs the same function considerably faster by executing many groups of HLL operations as a single algorithmic block.
Further, the interconnect structure between resources on an adaptive-computing chip is nonblocking, as compared to a DSP where bus contention issues are a major performance concern. This means that bandwidth is available whenever two resources (nodes) must connect, thereby contributing to higher performance.