Leverage Superscalar DSPs In Digital Communications Systems

Next-generation digital communications designs are crying out for higher throughput, lower system cost, and the flexibility needed to handle changing multiple standards and features. But, these requirements far exceed the traditional uniscalar digital-signal processors' (DSPs) ability to meet them. To understand why, it helps to take a detailed look at the design objectives for these next-generation end systems.

System cost: This factor, which has many components, is a universal objective of any OEM. The benchmark used to measure it is actually the "minimum cost per channel" required to implement the necessary DSP functions. Factors to take into consideration include the DSP devices, external memory, and support logic, as well as the cost of power supplies and cooling. In an effort to reduce system cost, DSPs must be able to perform multiple functions for multiple channels. This "multichannel processing" requires very high DSP throughput, but must not adversely affect the power dissipation and memory required to implement DSP functions.

Size and capacity: Infrastructure equipment is frequently limited by the size of an existing chassis or hardware footprint. The need for additional capacity and features, however, continues to rise. This impacts the system design in terms of DSP package size, number of channels per DSP, number of DSPs per board, support devices required, power, and thermal dissipation. All must be kept at a minimum.

Flexibility: Enter the dynamic world of digital communications. Features, standards, and requirements shift constantly. But, changes to the system should not require the customer—the service provider—to keep investing in new hardware. Thus, system designers must build hardware that can meet the current and future demands of communications systems—hardware that can be configured on-the-fly in the field or the factory. This impacts the DSP system by requiring programmable, RAM-based devices that have plenty of horsepower for new functions.

Development cost and time: Possibly the most difficult design objective to measure is the cost of developing a DSP system that includes both hardware and software components. The task of the DSP system may include speech coding, error correction, modem functions, and/or network functions. These must be implemented in software for execution on a programmable DSP. Even when a third party or the DSP vendor supplies some of the software, the OEM is usually responsible for any additional features, tying the software modules together for the particular system implementation, and for testing system performance. This is the most time-consuming development task for a multichannel DSP system. And, risk is directly related to the length of this task. Next-generation DSPs must address this time-to-market issue with advanced development tools and high-level-language (HLL) compatibility.

To summarize, it's the overall objective of the DSP system designer of digital communications systems to lower the cost per channel, increase the capacity, and retain or reduce the system's size and power dissipation. At the same time, the designer has to make it flexible to accommodate future needs and minimize development time and cost. These objectives drive the architecture design re quirements of next-generation DSPs.

DSPs: Can They Hack It? In the last year or two, several new, more efficient DSP architectures have appeared that borrow design techniques and architectures from the realm of high-performance microprocessors. These new designs employ aspects of superscalar microprocessors. Plus, they're reprogrammable, can handle multiple functions in less space, and deliver greater reliability for less cost and lower power than previous-generation solutions. But, they vary in the way they address system cost and development time. Examples of these next-generation DSPs include the RISC-superscalar (RISC-SS) and very-long-instruction-word (VLIW) architectures, each of which can deliver about 10 times the performance of many traditional DSPs.

To sustain a high throughput, which is a combination of architectural efficiency and processor speed, VLIW architectures employ multiple execution units that operate in parallel (Fig. 1). Instructions that direct the action of each execution unit are combined together for execution in the same cycle, forming a "very-long instruction word." This is very similar to the way the old bit-slice microcoding was done.

Multiple execution units are not the only factor contributing to high throughput. This approach also must work at high speed. So, a long, visible pipeline is often employed. But, a long pipeline creates long latencies when a change of context is required, and a visible pipeline doesn't check for data or resource dependencies. These factors contribute to the difficulty in programming a VLIW machine. Further, the "very-long instruction words" executed every cycle adversely impact code density.

When VLIW techniques are applied to DSPs, the result can be a processor that's more "general purpose" than "special purpose." The VLIW machine is a load/store architecture with a register file. It contains execution units for both logic and math functions. These features contribute to this architecture's HLL-compiler efficiency. The efficiency of this compiler, as well as related development tools, is critical to the software development of any multiple-execution device, especially a VLIW machine.

Finally, to sustain the machine's maximum throughput, the VLIW implementation must move enough data per cycle that all the execution units remain busy. This requires wide buses and frequent memory accesses—both of which dissipate power and add cost.

Summing up the features associated with the typical VLIW processor, you end up with the following list:

High speed coupled with parallel execution
Good compiler efficiency
Poor code density
High power dissipation
Difficult to program by hand
High speed = long pipeline = long latencies
Scalable

DSP circuits built with RISC-SS architectures also employ multiple execution units that operate in parallel to achieve high throughput (Fig. 2). This contributes to a high code density for minimizing the system memory required, thereby decreasing power dissipation and cost. However, RISC-SS approaches utilize a fixed-length instruction word coupled with the scheduling performed in hardware. By removing the scheduling burden from the programmer or compiler, this makes the machine easy to program. The pipeline is hidden from the programmer and managed by hardware using a fixed set of "rules." If the pipeline is kept short, and some degree of caching and branch prediction is employed, latencies can be minimized.

Like VLIW approaches, the RISC-SS implementations use a load/store architecture with a register file that serves as source and destination for the execution units. As stated earlier, this is a major factor that contributes to HLL-compiler efficiency. The features that a RISC-superscalar architecture offer are, in a nutshell:

High speed coupled with parallel execution
Good compiler efficiency
Good code density
Low power dissipation
Easily programmed by hand
Easily scaled or tailored to an application

DSP Software Challenges When designing a system around one of the next-generation superscalar DSPs, many software-related issues must be addressed. Some of those issues include what to expect from a parallel processor, control-code efficiency, code density, and HLL-compiler efficiency.

Generally, a communications system is analyzed by a high-level simulation tool for performance and complexity. Frequently, the result of this simulation is a system model described either in part or entirely in C. Once the hardware and software partitioning has been done, a target DSP can be selected for executing major portions of the software.

To see how this all comes together, let's examine a DSP system design using a model of the DSP algorithm written in C code. The task will be to implement the algorithm represented in the HLL description as assembly code. This assembly code executes a multichannel, real-time communications function on the target superscalar DSP hardware.

Parallel Processing Both VLIW and RISC-SS machines are parallel processors that employ multiple execution units which work in parallel in the same data path. This parallelism permits the processors to achieve a high throughput, often quoted in millions of instructions per second (MIPS). Not all tasks are parallel tasks, however, and scheduling independent tasks on a machine with a unified data path is close to impossible. The system designer must therefore evaluate the following aspects when considering DSPs based on VLIW or RISC-SS techniques:

The "architectural efficiency" (defined as how well the DSP can execute a specific task) of these DSPs excels in tasks with a high degree of parallelism. A speech coder and many classical DSP tasks, such as filtering, have this quality.
Conversely, the architectural efficiency of these DSPs will greatly diminish when executing a serial or control task. An example of a serial task is a convolutional encoder used for error correction, where a bit stream must be processed in time order. A typical control task is the parsing of control bits to determine the need for retransmission of a data packet.
More hardware isn't always better. There's a point of diminishing returns in a parallel processor because not all tasks have a high degree of parallelism. In a digital communications system where there's a mix of signal processing, bit processing, and control functions, a maximum utilization of one or two operations per cycle can be expected. Additional execution units will not improve this efficiency if implemented in the same data path.

Control-Code Processing Every DSP application is a mix of signal and control processing. The former is the classical "multiply/accumulate" repetitive task, which is block-oriented. Historically, it's within this type of task that DSPs are designed to excel. A control-processing task is decision-oriented—the typical "if-then-else" decision construct. Rough estimates say that the typical DSP system is 80% signal processing and 20% control processing. Thus, in a multichannel environment, the control processing portion increases above 20%.

Control processing typically degrades a DSP's throughput, since most DSPs aren't optimized to excel at control operations. These next-generation DSPs, with their roots in microprocessor architectures, have been tailored to handle multichannel systems. The DSP system designer, though, should concentrate on the several key aspects that follow when trying to decide which DSP implementation to apply to their system.

Interrupt structure: In a dynamic, multichannel environment, the DSP architecture must support a flexible, multilevel interrupt structure. This typically requires a minimum of three levels of interrupt nesting. The program control of the processor's interrupt level and support for the change of context also require attention.

Latency: Latencies occur in pipelined architectures when there's a change of program flow. Such latencies can happen because of interrupt handling, as well as the conditional branching frequently found in control code. Conditions in the machine that are non-interruptible also will cause response delays. Features such as branch prediction, caching, and short pipelines can reduce latencies and improve performance.

I/O overhead: The processor's input/output overhead also can affect performance. This I/O overhead occurs when a change of context is required for the processor to service an I/O port (interrupt). It can be greatly reduced if the processor has and uses direct-memory-access (DMA) hardware support. The DMA hardware services the port without interrupting the processor, so the pipeline remains intact until the DMA transfer is complete. The DMA support must work in a non-cycle-stealing fashion and have no major restrictions. To service an I/O interrupt without the support of DMA, the maximum latency for entry and exit of an interrupt service routine must be understood.

Control-code efficiency: The efficiency of the control code can directly impact chip performance. Both the RISC-SS and VLIW DSP architectures have multiple arithmetic logic units (ALUs) that can operate in parallel, offering good control-code processing. To evaluate how efficiently the control code runs, a type of state-machine benchmark can be used to gauge the control-code processing efficiency. If the application requires bit manipulation like parsing data, a separate benchmark should be used to exercise the DSP's bit-banging ability.

Availability and fit of an RTOS: To ensure that the applications can run efficiently, another major factor that should be examined is the availability and fit of a real-time operating system (RTOS). With the increase in popularity of DSPs, a number of third-party vendors offer RTOSs for various target devices. In general, both DSP architectures are well suited to an RTOS due to their load/store architecture, high control-code processing efficiency, multilevel interrupt structure, and HLL support. In addition to checking with the DSP vendors as to the availability of an RTOS, designers should delve deeper to determine the software's feature set and overhead to the processor. Also, inquire whether the OS can be tailored to the specific application. Check the size of the kernel and key modules, and whether the modules can be added or removed easily (that allows the RTOS to be optimized for a particular application). Even a system manufacturer that plans to create its own "scheduler" for the DSP can probably learn something from the commercially available operating systems.

Dense Code Saves Memory As mentioned earlier, minimizing the off-chip memory system helps lower system cost. Thus, the program code's efficiency and density become key to minimizing the memory footprint (the instruction memory space) required by the application code to implement a specific function. Superscalar devices have one common trait with respect to such a yardstick—they can trade code size for speed.

For example, both the RISC-SS and VLIW architectures can use a technique called "loop unrolling." In that scheme, the amount of code required will increase. But, all the delays associated with executing a loop will be eliminated, permitting the code to run faster. In loop unrolling, the loop construct is replaced with repetitive straight-line code, thereby avoiding pointer manipulation within the loop.

VLIW and RISC-SS processors offer users the ability to make this memory/speed trade-off in varying degrees. When assessing benchmarks from a DSP vendor, both the execution cycles and the program memory should be evaluated together. In general, a VLIW architecture will require more instruction memory than a RISC-SS for the same function.

HLL compilers also offer many benefits to designers who want to develop their applications using C or another high-level language. The most obvious benefit is a quick, easy port from a C model to assembly code. This can decrease the time-to-market, and gives the C-model code a high degree of portability. By changing compilers, the C code can be compiled into a different processor's assembly-code instruction set. That, in turn, makes it much easier to switch target DSPs without rewriting all the software. Also, code maintainability is much easier if the code is in C rather than in assembly code.

While porting a real-time application is easier with an efficient C compiler, however, it's unlikely that software development will end with compiled code. In a multichannel system, DSP throughput comes at a premium. Some level of hand optimization will typically be done, as well.

Compiler Friendly Superscalar DSP architectures are compiler-friendly due to their load/store architecture, large register file, stack support, and other hardware resources. Proof of compiler efficiency—a function of both the DSP architecture and the compiler implementation—can be obtained by examining several C benchmarks. When evaluating C benchmarks for DSP devices, the original C source, assembly code, program size, and cycle count should be obtained for the identical algorithm. Also, factor in the changes to the C source required to obtain the benchmark results, since they could impact the code's portability. Ideally, the vendor's coding recommendations for optimal compiler performance should also be obtained.

One area of divergence between the VLIW and RISC-SS processors has to do with their assembly-programmer friendliness. VLIW offers the challenges of scheduling multiple tasks, varying instruction latencies, and a visible pipeline. RISC-SS performs the scheduling in hardware, so the programmer only needs to craft linear code that will be executed in order. A hidden pipeline in the RISC-SS architecture resolves the data and resource dependencies in the instruction sequence.

This generally makes the RISC-SS machine easier to program at the assembly level. In addition, the compiler and debugger must be able to accommodate a mix of C and assembly code, as well as some hand coding.

Running The Algorithms As a very simple example of an algorithm targeted for a next-generation processor, a vector-multiply routine illustrates the possible steps required to optimize the routine for implementation. The basic routine merely multiplies the corresponding elements of two vectors and accumulates the products so that one sum is output. That entire operation represents a single pass of a cross-correlation routine. The C compiler and DSP used to illustrate the RISC-SS compiler and architecture consists of the RISC-SS ZSP16400 DSP chip offered by ZSP and a generic, uniscalar processor model.

As previously stated, many DSP systems start as a high-level model. In this case, the sample algorithm is implemented as a C program (Listing 1). The original source C code that performs the multiplication and accumulation of two vectors (one pass of a simple DOT product) compiles into an assembly-code block. To execute, this block requires 2033 cycles because the compiler could not take advantage of the DSP chip's resources.

This code is unmodified ANSI C, which can be compiled for execution on any target processor. The result is extremely inefficient DSP assembly code that takes no advantage of the fixed-point DSP target. That's because ANSI C has many features that make it incompatible with a fixed-point DSP architecture. For example, it has no fixed-point data type and no way to assign variables to specific registers. Thus, the compiler doesn't directly take advantage of the special features of the DSP hardware. And, it incurs a large overhead for function calls.

To better match the fixed-point DSP architecture to the HLL, the C code can be modified to produce more optimal DSP assembly code (assuming that the compiler can support such modifications). The C program can be modified to add an intrinsic function that maps the algorithm's multiply-accumulate (MAC) function to the DSP's dual-MAC instruction, making more efficient use of the DSP's hardware (Listing 2).

Also, a fixed-point data type has been added (q15). It uses the upper 16 bits of a product and an accumulator data type to assign the results of the MAC function to a specific register pair. The resulting DSP assembly code shows almost a four-fold cycle reduction versus the C code in Listing 1. But, this code still doesn't take full advantage of the architectural features of the DSP hardware.

By having the compiler use two aspects of the processor chip's architecture, rather than modify the algorithm's code, a further optimization of the compilation can be achieved. For comparative purposes, Listing 3 shows a hand-coded DSP assembly program for the DOT product. This program requires just 95 cycles to execute when optimized for the ZSP16400 architecture. That's less than 1/20th the number of cycles required by the basic compilation, and would permit many more computations to be done in a fixed amount of time.

The first aspect of the compilation optimization uses the processor's low-overhead looping construct. The second, a feature called "link registers," utilizes a specific set of DSP registers as data pointers to prefetch data. The same C code used in Listing 2 will now compile and execute in less than twice the number of cycles as the hand-coded assembly of Listing 3.

This was a very simple example. More complex algorithms may require restructuring of the code for better optimization, including loop unrolling. The overall objective, however, is optimal compiler efficiency of execution time and memory usage, combined with maximum source portability. Selecting the best architectural approach can be done once all the pluses and minuses of RISC-SS and VLIW are understood. The guidelines outlined here should provide a good starting point. Still, each application is different—and those differences must be taken into account.