Factors To Consider When Choosing The Right DSP For The Job

The recently-introduced Texas Instruments TMS320C67x and the well-established Analog Devices ADSP-2106x SHARC processors are the two highest-performance, floating-point DSPs on the market today.³ Which of these two processors provides the highest system performance? As we shall see, the answer really depends on the kind of task you're trying to perform. Keep in mind that Analog Devices (ADI) will be releasing their next generation SHARC processors, and Texas Instruments (TI) has an aggressive plan to increase the speed of the 'C67x range.

System engineers must select the device that provides the most effective solution to meet the requirements of their DSP application. While the obvious step is to compare the raw processing power of the two processors, this comparison will give little indication of expected system performance, especially in highly demanding multiprocessing applications.

Choosing the most suitable DSP platform, from a systems perspective, requires an analysis of many aspects of the application. First, the I/O data rates and channel density must be reviewed to determine the bandwidth in and out of the system.

The next step involves the mapping of DSP algorithms to DSP devices. This may be complex, and requires an understanding of I/O data paths, memory management, interprocessor communication capability, and synchronization mechanisms. While the resolution of these issues determines the best technical solution, other factors also require consideration. For example, time-to-market is influenced by the availability of third-party library support, and the characteristics of the development tools accompanying each processor.

A comparison of the two components logically begins with an analysis of the features of each device. Rather than a comprehensive feature list, this section summarizes the features that differentiate the performance of each (see the table). Full specifications are available in the data sheets provided by each vendor. As a detailed specification was not available for the 'C67x at the time of this writing, some parameters (e.g. power consumption) are not addressed here.

From the table, it is clear that the 'C6701 outperforms the 21060 in single-processor, low- and medium-bandwidth configurations. Using a conservative estimate of the sustained computational capacity of the 'C6701, its raw performance exceeds the 21060 by more than five to one.

However, the 21060, although less powerful, has other distinct advantages. Applications requiring large internal memory resources, either program or data, benefit from a configurable internal memory that is four times that of the 'C6701. In addition, multiprocessing applications can take advantage of the efficient native multiprocessing support of the 21060 processor. Finally, the 21060 has a higher cumulative I/O bandwidth than the 'C6701.

Of course, the 'C6701 has substantial I/O bandwidth and, with the assistance of external hardware, it may also be used effectively in multiprocessing architectures. This is investigated in the multiprocessing section.

Local Memory Support Is Key
It is clear that the SHARC gains the upper hand when it comes to internal memory capacity. However, it is rare that an entire application and its associated data can be accommodated in internal memory for either of these devices. It is, therefore, worth investigating the external memory options available in each case--and considering the performance. High-Performance Memory
There are many instances where the algorithm developer needs high- performance external memory, but in some circumstances, it is critical to the application. For example, high performance is required when code must be executed directly from external memory, and when critical variables (e.g. filter tap coefficients) are stored externally due to a lack of internal resources. Both the SHARC and the 'C67x support high-performance external memory.

A SHARC processor is easily interfaced to asynchronous SRAM (ASRAM), accessible in a single 25-ns clock cycle. Of course, ASRAM is both expensive and low in density, with a practical maximum capacity of 512 k-by-32 per cluster in most commercial-off-the-shelf (COTS) implementations.

The 'C67x directly supports SBSRAM, SDRAM, and ASRAM as high-performance resources. This memory is currently available at 133 MHz, supporting an access every two 6-ns clock cycles of the DSP. It will likely be available at 166 MHz by the time the DSP is shipping, allowing for single-cycle access. The pipeline delay of SBSRAM should be taken into account in throughput considerations, as it is another three cycles for each first access. The consequence here is that critical sections of code must be run from internal DSP memory as the memory will require more than 8 clock cycles to load a single 256-bit instruction from any external memory. As with ASRAM, SBSRAM is expensive and low in density, with a typical allocation of approximately 128k by 32 per DSP in COTS 'C6x boards.

In summary, the SBSRAM interface of the 'C67x gives it a major performance advantage when accessing external memory, four times the throughput of a SHARC accessing ASRAM. However, this can only be realized for multiple consecutive external accesses where the pipeline delay becomes negligible. Furthermore, in cases where consecutive instructions must be accessed from external memory, the theoretical performance of the 'C67x can be reduced from 1328 to 166 MIPS. The SHARC sustains its 40-MIP rate whether it executes from internal or external memory.

High-Density Memory Support
In data-driven applications (e.g. imaging and radar), the DSP requires high-density memory for temporary storage of data. Usually, memory access is sequential due to the correlated nature of the data.

With the addition of some external logic, the SHARC can be interfaced to low-cost, bulk DRAM, with one or two 25-ns wait states. It is fairly typical to find COTS configurations with 64 Mbytes or more of DRAM per cluster. The 'C67x, on the other hand, supports a glue-less connection to SDRAM.

As with SBSRAM, there is a pipeline latency of three cycles, but sequential accesses take two 6-ns clock cycles. Paging and refresh delays also need to be considered as these will result in non-deterministic delays of ten cycles or more. In spite of this, SDRAM clearly has an advantage over DRAM when making sequential accesses to large sets of data.

What About Multiprocessing?
The "subsystem" (device and local memory) comparison presented above does not address system performance concerns associated with multiprocessor implementations using either of the devices.

If multiprocessing is necessary to meet either the real-time demands of the application or high I/O rates, DSP system performance becomes more relevant than device features. System performance considers algorithm and data distribution in addition to interprocessor communication capability.

Data storage and distribution: Whether you use a SHARC or 'C67x platform, it is a good practice to decouple the flow of data from the actual processing algorithms. This can be done using DMA coprocessors to manage data flow between subsystems by transferring large blocks between intermediate buffers. This is particularly important for the 'C67x where optimized inner loops running on the DSP cannot be interrupted to service I/O or manage data.

By decoupling data structures, these software pipelines will be allowed to run to completion, ensuring peak performance. Of course, if extremely low latency is a requirement, 'C67x loops must be unrolled at the expense of code size. Even then, the memory pipeline of the processor results in a latency when switching tasks (i.e.: an 11-instruction latency to flush the pipeline and vector to the new address). In contrast, inner loops on the SHARC processor are interruptible, making it easier to balance low-latency I/O performance with optimum CPU performance.

When it comes to distributing data around a multiprocessor system, the SHARC supports this directly through Link Ports and broadcast capabilities of the cluster architecture. The 'C67x relies on the DSP board architecture to provide a flexible communication system, with external DMA facilities to move data between DSP subsystems.

If the software developer is used to mapping algorithms directly to standard nodal topologies as a method of distributing the algorithm (e.g. mesh or hyper-cube), the SHARC probably remains the processor of choice as it supports these physical topologies through Link Port connections. However, if a 'C67x platform is selected with a DSP RTOS that supports a virtual network between tasks, the standard topologies can still be implemented in abstraction from the hardware layer.

Interleaving code and data: If algorithms are run from internal memory, it's easy to predict the data I/O throughput for both processors. If algorithms are run from external memory, a more careful analysis may be required.

Because the SHARC only supports asynchronous external memory, it's still easy to predict the throughput when code is run from external memory, even if this code is interleaved with data on the cluster bus.

If algorithms are run from SBSRAM on the 'C67x, code is burst into internal memory at 666.7 Mbytes/s (assuming a 166.7-MHz memory bus), with a three-cycle initial latency to fill the pipeline of the external memory. If these code accesses are interleaved with SDRAM data accesses, for example, prediction becomes complex due to paging and refresh-cycle latencies, and performance is poor. Generally, code would not be executed from external memory. For large algorithms, it is more efficient to run the processor with the cache enabled, allowing execution from internal memory.

Interprocessor messaging: The efficient passing of semaphores and low-latency messages is integral to any multiprocessing system. The SHARC supports these through multiprocessor memory space within a cluster, and Link Port connections between clusters and DSP boards. As with data flow, the 'C67x relies on external resources provided on the DSP board. For example, Spectrum includes DPRAM and QPRAM in the dual and quad 'C67x implementations of the FastTrack architecture. This memory connects directly to the external bus of all the processors. It provides a low-latency path between subsystems.

Of course, interrupts provide the lowest-latency mechanism for interprocessor signaling and synchronization. Whether you choose a SHARC or 'C67x, you should make sure that the DSP carrier board supports interprocessor interrupts.

Pumping Data In...And Out
By definition, DSP applications are required to move digitized waveforms in and out of the system. Due to the diverse nature of the real-world signals, this data varies in bandwidth, resolution, and number of channels--and it is impossible to generalize the I/O processing requirements. Let us consider a few "typical" scenarios:

Single processor as target: It's safe to say that if a single DSP is the target of all input data, system considerations are similar whether you select a 'C67x or a SHARC processor as the DSP.

Assuming that the application can run from internal memory, a 'C67x is more efficient in managing a single high-bandwidth stream than the SHARC due to the high performance of the external memory interface (EMIF). An image recognition system, for example, may have a high- bandwidth pixel stream that is processed by a small correlation kernel residing in internal memory.

The SHARC, on the other hand is more effective using its DMA resources to manage multiple medium-bandwidth channels, assuming the data is available on the link ports. In applications where I/O data transfers from the I/O port to local DSP memory are interleaved with processor data accesses (local memory to internal registers), there is a trade-off between data-block size and real-time response, no matter which DSP is selected.

High-bandwidth multiprocessors: If the MFLOP requirement of the application is excessive (due to high bandwidth), either way you're going to need a multiprocessor solution. This section investigates these high-end applications.

Due to its inherent multiprocessor support, the SHARC fits these applications. The network can easily be scaled to suit the I/O processing requirements. The availability of off-board Link Port connections makes scaling just as easy across multiple DSP boards as it is across DSPs on a single board. Additional features, such as the capacity to broadcast data throughout a cluster, make distribution of the input data easy.

The 'C67x, unlike its floating-point predecessor, the 'C40, has no native multiprocessing support. It has been left to the DSP board vendors to innovate effective methods of achieving inter-processor communications. Spectrum's FastTrack architecture is an example of this, using a specialized ASIC to bridge each DSP to a common PCI backbone. This allows for a distributed-memory architecture, with each DSP having the ability to pump data to the local memory of any other DSP on the same board. However, it's more difficult to distribute the data across multiple boards.

It is considered a poor practice to use the system bus (VME, PCI, cPCI, etc.) for high-bandwidth data, and consequently, a number of I/O buses (e.g. FPDP and Raceway) to support the multiple-slave-DSP boards networked to an I/O master. The I/O-bus-to-DSP-carrier board connection is often implemented using open standard interconnects, e.g. PMC modules. If tighter coupling between the I/O and DSPs is required, this may achieved by connecting the I/O directly to the external memory bus via a local mezzanine such as the Processor Expansion Module (PEM).

Whether a 'C67x or a SHARC is selected, there are numerous DSP network topologies available to support the I/O data flow requirements of most applications. For example, both Spectrum's Morocco SHARC platform and FastTrack 'C67x architectures support PMC-based I/O streamed to all of the local processing nodes. This allows the engineer to swing the incoming stream between two or more processors to spread the load. In both cases, the throughput is limited by the performance of the local PCI bus rather than any DSP capabilities. In distributed I/O applications, SHARC CEM interfaces (240 Mbytes/s) and 'C67x PEM interfaces (333 Mbytes/s) are provided to allow low-latency, high-bandwidth connections to the local cluster or EMIF of each DSP.

Low-bandwidth multiprocessors: There are two instances where applications require multi-DSP configurations with low data rates. First, there are applications with computationally intensive algorithms where the I/O bandwidth exceeds the processing capability of a single processor. Secondly, in applications with multiple I/O channels, it is often convenient to distribute the I/O processing across a network of DSPs.

In the first instance, the 'C67x may offer a better solution because it will reduce the number of DSPs required in the system due to the higher CPU core performance. In the second example, with limited channel count, either a 'C67x or SHARC may be appropriate. Once the channel count demands a multiboard solution, the SHARC may be preferable due to its inherent support for interboard communications through Link Ports.

Finally, both the SHARC and the 'C67x support two TDM serial ports. Most DSP board vendors make these available to the user for direct connection to their I/O circuits.

You may still be asking "which technology will get my application to market first?" Good question. No matter which DSP you select, the majority of your design cycle will be spent developing software. If the support tools are good, your application will likely be a success even if you did not select the optimum DSP. If the tools are inappropriate, the best DSP will lose its advantage. It's worth considering if the tools support your specific application. If they do, the assistance provided by third parties should significantly reduce your time-to-market.

Multitasking support: It is easy to conceive mapping multiple tasks to multiple DSPs in a SHARC network, especially if we consider a single task per processor. In simple pipelines or array processing applications, the SHARC may be the processor of choice due to its support for separate tasks or algorithms at each node of a multidimensional array. However, many 'C67x (and SHARC) applications may require multiple tasks multiplexed onto each DSP. In such cases, a DSP-based RTOS provides the developer with a scheduling kernel to simplify development. Some of these RTOS kernels (e.g. 3L's Diamond) have very low overhead, and provide other features (e.g. intertask communications independent of the underlying hardware).¹

The 'C67x processor will likely be the target of multi-instance applications (e.g. modems). Once again, development can be simplified through the selection of an appropriate RTOS to manage context switches and multiple data streams. Diamond and others (e.g. Eonic's Virtuoso ) are currently supported on SHARC, and will be available on the 'C67x shortly.⁴

Third-party software: The availability of optimized function libraries (e.g. Imaging, Math and Signal Processing) allow developers to concentrate on their own applications rather than time-consuming hand coding of commonly used building blocks. It usually takes a year before third- party library support for any DSP processor is available, and the 'C67x will probably be no exception. TI does a good job of keeping an up-to-date web site with free code examples; this is a useful resource.⁶ The SHARC, as a mature product, currently has optimized library support from companies like Wideband Computers.⁷

Both TI and ADI supply a solid suite of DSP development tools. If there is any difference, it's in the way that TI's C67x tools focus on code optimization, while the strength of ADI's Visual-DSP lies in its multiprocessor support.

For example, the natural development methodology using TI Tools is as follows:

1. Develop the application in C.

2. Write inner loops in linear assembly language.

3. Use the assembly optimizer to take full advantage of the chip's VLIW architecture.

Without the assembly optimizer to assist the developer, the 'C67x would be crippled.

In contrast, the architecture of a single SHARC DSP is simple, and no optimization tools are required to maximize CPU performance. The management of multiple tasks on different clusters is complex. However, Visual-DSP simplifies C code development in this multiprocessing environment through a sophisticated linker that supports shared memory and multiprocessor linking. Additionally, flexible overlay support allows the development of code that can be moved between overlays and non-overlay memory without rework.

Classifying By Application
Both the 'C67x and SHARC are inherently targeted at some DSP market segments, purely because they are both floating-point processors. These applications include those with wide dynamic ranges or poor signal-to-noise ratios. Examples include remote sensing and medical imaging, precision control, and some communications applications.

It would be easy if we could classify the 'C67x or the SHARC according to application (e.g. DSP X works for sonar and DSP Y is best for medical imaging). Unfortunately, this is seldom possible.

Let's take sonar as an example. Within sonar, we may get a simple replica correlation application running on a DSP connected to a single hydrophone and an alarm. A towed-array sonar system, on the other hand, may have a few hundred sonar pods feeding into a meshed array of DSPs running multiple beam-forming algorithms. In the first instance, the best solution may be one or two 'C67x processors, while 64 SHARCs may be more appropriate for the sonar array-processing application.

Floating-point DSPs are also selected as development platforms for fixed-point applications. This is due to the ease of coding during the proof-of-concept phase. In this light, both processors may be used as the springboard for any fixed-point application --as with floating-point, it is impossible generalize here.

In conclusion, it is more appropriate to select the DSP platform according to the multiprocessing, I/O, and support requirements discussed above than to attempt to classify the applications. Here, we have investigated various aspects of single and multiprocessor implementations using both the 'C67x and SHARC processors. While the 'C67x appears to have a performance edge in single-processor implementations, both have their strengths in multiprocessing applications. The most suitable platform depends on data flow, memory requirements, array topology, and algorithm characteristics.

References: 1. 3L World Wide Web Site (http://www.ThreeL.co.uk).

2. Concurrent Programming, Alan Burns and Geoff Davies, Addison-Wesley, 1993.

3. Analog Devices ADSP-2106x World Wide Web Site (http://www.Analog.com).

4. Eonic Systems Inc. World Wide Web site (http://www.Eonic.com).

5. "Sophisticated tools bring real-time DSP applications to market," J.H. Meyer, Military & Aerospace Electronics, January 1998.

6. Texas Instruments 'C67x World Wide Web Site (http://www.TI.com).

7. Wideband Computers Inc. World Wide Web Site (http://www.Wideband.com).