Choosing The Right DSP For Real-Time Embedded Systems

Each DSP and data-acquisition system requirement presents a unique combination of delivery, cost, packaging, and performance goals to the designer. These factors must be weighed and balanced in order to arrive at an optimum architecture and design approach. Commonly, one of the first and most critical decisions to make is choosing the appropriate DSP for the system, because this selection dramatically affects every aspect of the engineering cycle. This choice can literally make the difference between a project's success or its failure.

When picking a DSP, it's important to examine both the hardware and software factors. We will first take a look at the hardware factors involved in this decision.

External bus architectures: External buses are extremely important for supplying the DSP with two vital resources: program code and data. Depending on the processor and the application, either one or both of these may be the limiting factor or bottleneck in the design and could become even more critical than the raw computational speed of the device.

The bus architectures of three popular DSPs are listed in Table 1. They are the Analog Devices ADSP21160, and two devices from Texas Instruments (TI), the TMS320C6701 and TMS320C6203.

The C6203 is the only device with two independent 32-bit buses that can operate in parallel for simultaneous cycles on both. This is extremely useful if program code can't fit within internal program memory. One bus can be employed for external program memory fetches while the other performs external data fetches.

Because the 21160 has an external 64-bit data bus, it supports single-cycle fetches of external 64-bit data words and 48-bit instruction words, both of which would otherwise require two fetches over a 32-bit bus. As we shall see, program fetch cycles for the C6701 over a single 32-bit bus can be quite critical.

The address bus for the C6701 is limited to 24 bits, which is much more appropriate for an embedded, dedicated DSP rather than a general-purpose device with many different types of peripherals. For open-architecture, embedded, board-level DSP products, the extensive memory space of the backplane bus favors devices with at least 32 address bits.

Memory architectures: Internal memory is another important hardware factor to consider. The internal memory resources for the three processors are summarized in Table 2. The 21160 provides a generous 512 kbytes of internal RAM that can be allocated freely for either 48-bit program code, or for data words of 16, 32, or 64 bits. This memory is divided internally into two equal memory blocks that may be accessed in parallel for simultaneous program and data transfers.

The processor uses a single-instruction/multiple-data (SIMD) architecture with two identical arithmetic engines executing in parallel from a single, common instruction word. The two internal RAM banks are connected to these two CPU engines over independent data paths. This arrangement doubles the data processing throughput for many applications.

Also featured is a 32-word program cache. Once program code has been loaded into the cache, both internal RAM memory blocks are free for data fetches over the associated internal buses. In this case, two operand fetches plus an instruction fetch can all occur within a single 10-ns cycle. Each arithmetic section supports three concurrent floating-point arithmetic operations, resulting in a peak processing rate of 600 MFLOPS.

Both the C6701 and the C6203 utilize very-long-instruction-word (VLIW) engines, each with eight arithmetic elements capable of operating in parallel. Unlike the SIMD model of the 21160, every element executes its own 32-bit instruction. For this reason, the internal program memory is organized in 256-bit words (8 by 32 bits) to support the execution of these VLIW instructions once per clock cycle. For the 167-MHz C6701, six of the eight elements are floating-point units, resulting in the 1000-MFLOPS peak rating. All eight fixed-point elements of the C6203, operating at 300 MHz, produce a peak rating of 2400 MIPS.

One drawback of the C6000 family is that in order to benefit from these tremendous processing rates, program code must execute from internal memory. Otherwise, eight 32-bit instructions need to be fetched over an external 32-bit bus for each VLIW instruction. On the C6701 with only one external bus for both data and code, this can substantially reduce the processing rate.

But unlike other forms of VLIW instruction devices, the C6000 family optimizes internal program memory utilization through efficient packing of the execution word and conditional execution on every instruction. Another unique feature of the C6000 family is the program cache controller. It operates on a 128-kbyte block of program memory, not just on a few words like the 21160. Large blocks of critical routine code can be loaded into the cache by enabling the cache during the first access to external memory. This can dramatically reduce costly access to external program memory. It's one of the most critically important features of the device.

DMA engines: An important hardware resource for all DSPs is the DMA controller. This device can dramatically reduce the processing load on the processor by handling all of the significant data movement tasks.

On the 21160, the fourteen DMA channels are somewhat specialized as they're dedicated to moving data between internal RAM and specific hardware resources (Table 3). Six DMA channels are dedicated to link ports, and four are allocated for the serial ports. The remaining four channels connect to the processor's external port for accessing other processors, external memory, and external I/O peripherals. These DMA channels have excellent connectivity to the interrupt system, and they support two-dimensional DMA as well as DMA chaining for automatic linked transfers.

All four DMA channels of both C6000 devices are general purpose with the ability to move data to and from internal data memory, internal program memory, internal peripherals (such as the serial ports), and external memory and devices.

Special features of the C6000 DMA channels include support for 8-, 16-, and 32-bit transfers, full-duplex-mode transfers, and extensive support for framed data structures from the serial port controllers, which is useful in many telecom applications.

Serial ports: All three processors feature two or three serial ports (Table 4). Each of these full-duplex synchronous ports is tightly coupled to the interrupt system and the DMA controller engine to move data efficiently into and out of internal RAM.

Featured by the C6000 family is an exceptionally versatile serial controller called the multichannel buffered serial port (McBSP). Its I/O has been optimized for a wide range of popular serial communication processing tasks. Internal logic supports configurable serial word sizes from 8 to 32 bits and up to 128 TDM time slots within a frame.

Interprocessor communication: Moving data efficiently between processors is essential in many applications, especially when the main processing task must be partitioned among multiple processors. A traditional method for handling this partitioning is the use of a stream-oriented pipelined approach with data moving in and out of each DSP. Here, the appropriate algorithm is applied on the way through.

Another method is to connect the processing nodes in mesh, star, or other geometric configurations. A third approach employs a shared memory resource that all DSPs can access as required. Performance of the shared model is usually limited by the fact that only one processor can utilize the shared memory at a time.

The 21160 features six byte-serial link ports, each supporting dedicated 100-Mbyte/s interprocessor transfers for either input or output (Table 5). These link ports facilitate many different classes of standard multiprocessor configurations, in both two- and three-dimensional arrays.

One major benefit of these links is that they may be operated in parallel because they utilize a private, dedicated data path and are independent of any external bus activity. A second bonus is that DMA channels can be assigned to handle the data movement automatically as a background task, freeing the main CPU in the DSP for more important chores. Additionally, the 21160 features a "cluster" bus that allows multiple DSPs to read and write to each other's internal memory over the 64-bit external bus.

Aside from the serial ports, the C6000 devices have no interprocessor links like those in the 21160. These products must instead rely on external circuitry to handle these transfers. Now, let's examine the software evaluation factors.

Fixed versus floating point: Fixed-point DSPs overwhelmingly dominate the market over floating-point devices in both sheer numbers and dollar share—by at least ten to one! The reasons are fairly well understood.

Fixed-point devices are less expensive, draw less power, and occupy less silicon real estate than their floating-point counterparts. In high-volume applications, these factors dominate the decision. The recurring cost of the end product is the most critical factor. The engineering development effort can be quite large, but it can be amortized over a large number of units.

Floating-point devices are much more forgiving of software because they can handle enormous variations in numerical values and still provide extremely accurate results. This significantly cuts down the software development effort and allows the software designer to work in a higher-level language.

Fixed-point processors require that very careful attention be paid to scaling, so overflow and underflow during calculations will be avoided. Care must be taken to maintain as much of the dynamic signal range as possible. The fidelity of the overall signal path depends on the useable range of the most poorly scaled point in the flow.

Given enough time, testing, and optimization, the fixed-point processor can be made to work quite well for many applications. Some applications, like image and video processing, inherently thrive on fixed-point processing. In these applications, the dynamic range of the signals is well-defined on an intensity scale and floating-point performance would add little benefit.

Floating-point devices are extremely popular with government and defense contractors who need to create a relatively low number of complex, high-performance systems on time and within budget. In these cases, the development and integration costs represent a significant portion of the sale, and floating-point DSPs can help contain these costs.

In addition, it's very likely that the system might have to be maintained, upgraded, and enhanced during its lifetime, which could be as long as 10 to 15 years. Trying to upgrade hand-optimized, fixed-point assembly code is much more costly and unpredictable than working on a floating-point device in a higher level language, such as C.

Regarding the DSP candidates at hand, the 21160 and C6701 designs both deliver floating-point performance, while the C6203 is a fixed-point device.

Execution speed and benchmarks: As mentioned above, critical algorithms can sometimes dominate the DSP choice. The weight of this factor obviously depends on the percentage of time that the DSP performs a given algorithm. For this reason, benchmarks are especially important for DSPs dedicated to one or two limited tasks, a common scenario in high-volume, embedded applications.

For example, the DSP in an active noise-cancellation device, such as one found in a jet aircraft, may spend nearly all of its time performing an adaptive filter algorithm. On the other hand, a DSP on a PCI card for computer telephony in a desktop computer might be juggling a large number of diverse and asynchronous tasks. These tasks could include handling an analog-to-digital converter (ADC) and a digital-to-analog converter (DAC), performing tone decoding and generation, processing demodulation and filtering algorithms, updating an operator screen, and responding to keyboard commands.

Take a look at some published benchmarks for two popular algorithms, the fast Fourier transform (FFT) and finite-impulse-response (FIR) digital filter (Table 6).

The 21160 figures in the left subcolumn show the actual execution time for the algorithm. If the application involves two data streams with identical processing, then the SIMD architecture can effectively slash the time per stream in half, as revealed in the right subcolumn. Due to its strong legacy from the original 21060 SHARC, the 21160 is especially well optimized for FFTs and radar applications.

Because all three processors have dual arithmetic engines, two multiply accumulate (MAC) cycles—the fundamental operation for FIR digital filters—can be computed simultaneously. For this reason, each tap of an FIR filter takes one half the clock period, as shown in the table. The details and techniques involved in creating these benchmarks are beyond the scope of this article. But information is available by contacting TI and Analog Devices or from viewing their web sites, which explore a wide range of various benchmarks. Because new code and application notes are continually becoming available for these devices, it's best to check these sources frequently.

Software tools: The 21160 is supported with Analog Devices' VisualDSP integrated software development environment. That includes project management, a debugger, an editor, and an ANSI C compiler. Also available are a C runtime library for both single and multiprocessing DSP applications, math functions with numerical C extensions for vector, matrix, and array processing tasks, and a full instruction simulator.

The DSP Collaborative initiative from Analog Devices extends the VisualDSP tool set by enabling independent third-party companies to add value using a published set of application programming interfaces. These include real-time operating systems, emulators, high-level language compilers, and multiprocessor tools.

TI offers eXpressDSP tools for the entire C6000 family, including an integrated development environment called Code Composer Studio. It features a built-in editor and interactive debugger with extensive trace, breakpoint, and profiling capabilities. TI's Optimizing C Compiler and Optimizing Assembler maximize performance by utilizing as many of the eight arithmetic elements in parallel as possible.

Unique to the DSP industry, the C6000 assembler also offers many modes of optimization. One elegant and very useful feature allows the implementation of virtual register management. Here, the programmer uses a symbolic name for the registers, and the assembler allocates the actual hardware registers to optimize code packing.

When handling multiple streams of data, the compiler frequently can optimize the code to calculate as many as eight streams of data in parallel. This will support the many telephony applications targeted by the C6000 architecture.

Other components comprising the eXpressDSP initiative are a real-time kernel called the DSP/BIOS, an algorithm standard known as the XDAIS, and an extensive list of compliant third party offerings for libraries, real-time operating systems, and code-generation packages.

Putting It All Together Armed with this comparative information, we'll take a look at three different DSP applications to see how each processor can be deployed in real-world systems.

Computer telephony echo cancellation: Widespread use of computer telephony has opened up many applications for DSPs. Suppose we had to design an echo-cancellation system capable of handling 64 voice channels sent over two E-1 digital telephony streams. Digital echo cancellation relies on adaptive digital filter algorithms to remove annoying echoes prevalent in virtually all analog phone channels from reflections in the lines. Because most telephone networks still rely on some kind of analog links to the end user, this is an extremely important function, especially for analog modems and fax transmissions.

Each E-1 stream is a time division multiplexed (TDM) signal carrying 32 voice channels with 8-bit samples for each channel sent sequentially in a frame of 32 time slots. The frame rate is 8 kHz, resulting in a serial-bit rate of 2.048 Mbits/s. With variable data word sizes and up to 128 TDM slots, the three multichannel buffered serial ports of the C6203 can be easily configured to match the time slot and framing characteristics of the E-1 lines. All of the serial ports are full duplex. Therefore, unprocessed input data streams and echo-cancelled output data streams can share the same port.

Echo-cancellation algorithms, available from TI for the C6201 fixed-point processor, are well characterized for various "tail" lengths, which is the maximum echo delay that can be cancelled, relative to the internal memory and processing speed. With 50% more speed and four times the internal data memory of the C6201, the C6203 can handle up to 69 channels with a "tail" length of 48 ms. Using two of the serial ports for full-duplex operation to handle 64 channels is a very good fit for the 69-channel maximum.

Although the 21160 and the C6701 could be used for this application, their significantly lower processing speeds wouldn't let a single processor handle all 64 channels. Additionally, the floating-point precision of these devices would mostly be wasted because the signal-to-noise performance for this application isn't as critical, and the fixed-point accuracy of the C6203 is quite sufficient.

Radar array processor: For the second application, we have to perform real-time radar processing for a large, ground-based phased array radar system. Each element of the array must have its phase electronically controlled to "steer" the radar beam pattern. Received signals from every element have to be processed and then combined to form a single, composite high-resolution image.

As previously mentioned, the original architecture for the 21160 was inspired by radar processing requirements that heavily rely on performing extremely fast FFTs. With SIMD processing, the 100-MHz 21160 executes a 1024-point complex FFT in 90 µs, which is even faster than the C6701 operating at a higher clock rate of 167 MHz.

In radar processing, sophisticated signal-enhancement techniques can be used to extract very small targets, provided that the accuracy of the calculations is maintained throughout. For this reason, the extra precision offered by floating-point processors, like the 21160, proves to be a major advantage over fixed-point designs, such as the C6203.

Because there will be many processors acting in parallel to meet the processing de-mands from the many elements, the six 100-Mbyte/s interprocessor link ports are another advantage for the 21160 in this application. They can be utilized to very effectively merge signals from each of the many antenna elements to form the final image. With no built-in multiprocessing resources, the C6701 would require a significant amount of external hardware to handle these transfers.

Another reason to select the 21160 for this project is the wealth of radar software already developed for the original 21060 SHARC processor. This is fully code-compatible with the 21160.

Remote signal intelligence receiver: The third system involves a software radio requirement for a remote, standalone receiver system capable of receiving, classifying, demodulating, and storing a wide range of known and unknown transmissions. At the front end of this system is an antenna, an RF downconverter, a wideband ADC, and a digital receiver. It's important to be able to pick out a very weak signal in the presence of other large signals. For this reason, a floating-point processor is ideal. The scope of this remote receiver strongly suggests implementing a single processor, if possible.

At 1000 MFLOPS, the C6701 offers the maximum of floating-point processing power. Furthermore, the absence of multiprocessing support on the C6701 isn't a factor in this single-processor system. The large number of different demodulation and classification algorithms can be efficiently developed from floating-point routines. This provides the necessary accuracy without the lengthy software development effort to handle the scaling and optimization that a fixed-point processor would require.

The C6701 first computes an FFT using wideband output from the ADC to determine the frequencies of signals present. Next, it tunes the digital receiver to the signals of interest, again performing a spectral analysis to try to classify the modulation technique. The most probable demodulation algorithm is then performed on the narrowband signal and the output is analyzed for meaningful content. Several different demodulation schemes might have to be tested until a useful signal is obtained. If the intelligence gleaned is useful and important, it can be saved in a local hard disk along with a time and date stamp for later retrieval.

For additional savings of valuable development time, numerous third-party software library tools for the C6000 family help complete the list of required analysis and demodulation functions.

Every application will require that the system designer assign the appropriate weighting factors to the strengths and weaknesses of each DSP device under consideration. By identifying these factors early in the design cycle, and by picking the right DSP for the job, the project is much more likely to hit high marks in cost, performance, and delivery.