The shift to larger data words and higher sampling rates in audio applications is creating challenges for the audio developer. Here’s some advice on how to deal with the transition.
By Louis Bélanger, Lyrtech and Gerard Andrews, Texas Instruments
The demand for high-fidelity is on the rise, evidenced by the dramatic growth of musical instrument and professional audio products moving to 24-bit words and 192 kHz sampling rates. Traditionally, audio applications have been implemented upon 16- and 24-bit fixed-point architecture, which begin to break down as sampling rates increase. Handling audio at 24-bit, 192 kHz requires a wide dynamic range and introduces strict signal-to-noise (SNR) design constraints. Implementing audio effects and pre/post-processing algorithms within such demanding conditions requires efficient use of available system resources as well as development tools that enable developers to quickly optimize code implementations without having to resort to extensive and time-consuming hand-coding in assembly. For these reasons, many developers are turning to 32-bit floating-point DSP architectures to meet their cost, performance and time-to-market goals.
One of the main challenges in digital audio processing is the preservation of dynamic range. Dynamic range is a measure of the breadth of values that can be represented when processing an audio signal. To understand the effect of dynamic range, consider attempting to represent filter coefficients in a 16-bit fixed-point format. When working with a fixed-point architecture, the decimal place is assumed by the developer. There are 65536 steps involved with 16-bits and, to simplify this example, assume each step corresponds to a change in value of .00001. In this way, the numbers from .00001 to .65536 could be represented to a granularity of .00001.
Here lies the limitation of dynamic range when applied across a wide range of values. At the high extreme, it is possible to represent small changes in amplitude: the difference between .65536 and .65535 captures incremental changes of 0.001 percent. At the low extreme, however, a similar change in amplitude represents a significant difference: the difference between .00001 and .00002 is an incremental change of 100 percent. The fixed representation of low numbers is effectively loaded with leading zeroes. In other words, only relatively large changes in value can be represented at the low extreme, because the dynamic range has been stretched so far apart.
Of course, the more bits used to represent a number, the wider the range of numbers that can be represented. For example, with 24-bits, 16,777,216 different values are available. However, even with this many bits, a fixed-point representation can still limit audio quality.
A concrete example of where dynamic range plays a critical role is a filter for a subwoofer. Subwoofers can have a cut-off frequency as low as 30 to 40 Hz., even at CD-quality sampling rates of 44.1 kHz, which means that the subwoofer filter is working at approximately 1/1000th the sampling frequency. As a consequence, the coefficients for this filter will vary greatly, possibly from very small numbers, such as .00001, to larger numbers, such as .99997.
This example illustrates how dynamic range is sacrificed at the low-end to represent both small and large numbers in the same fixed point representation. In other words, this is a sacrifice in relative accuracy. Because of the way the amplitude is represented in a fixed point format, there are fewer bits available to represent small changes. Maintaining the stability of such a filter is difficult when it is implemented in a fixed-point architecture, because of dynamic range limitations that arise from having to work with numbers that are both so large and so small.
Developing with Floating-Point
Floating-point architectures eliminate the breadth limitations of dynamic range introduced by fixed-point architectures. Because the decimal point can float, all of the available bits can be used to represent small changes relative to the actual magnitude of the value. Using floating-point, an equivalent incremental change can be captured when representing both large and small numbers, because leadings zeroes are eliminated and all the bits can be used more effectively.
Due the automatic scaling of numbers that is done in floating point processing, many audio developers prove out their designs or algorithms using a floating-point architecture. In this way, they can quickly test different implementations without having to worry about manually scaling the numbers. In the past, once the floating-point design was completed, the developer then spent several months converting the design to a fixed-point architecture to reduce system cost.
Recent advances, however, have lowered the cost differential between fixed- and floating-point architectures, making it appealing for developers to go to production with the floating-point device used to design the initial system. Doing so eliminates the time-consuming and labor-intensive task of converting between fixed- and floating-point. The realized cost savings, as well as the ability to reduce total time-to-market faster by several months, often results in a lower overall system cost.
Additionally, fixed-point implementations are extremely difficult to modify. Often, the code has been written by hand in assembly and therefore future modifications can result in the need to re-optimize code. With floating-point, developers can implement algorithms directly in C and let optimizing compilers take over the task of fine-tuning code for performance.
The C Advantage
Today’s DSP floating-point architectures are designed with cost and ease-of-development in mind. For example, the latest floating point DSPs from Texas Instruments are based on a very long instruction word (VLIW) architecture. This architecture, which was developed in conjunction with the accompanying optimizing compiler, features a highly orthogonal instruction set making the DSP an optimal compiler target. With independent functional units and an orthogonal instructions set, these VLIW machines are designed to maximize parallelism, reduce overhead and reduce the complexity of the programming task. By abstracting code and moving away from direct assembly implementations, optimizing compilers and other development tools can take on much of the burden of maximizing overall performance.
It is important to note that in some cases an optimizing compiler can create even more efficient code than a developer hand-coding assembly. There are several types of optimizations, such as out-of-order execution and loop unrolling that involve extremely complex register allocation, which would be time-consuming for a developer to code by hand. Additionally, these optimizations are highly dependent on adjacent code and must be fine-tuned whenever application code is modified, making code modification and maintenance extremely difficult. An optimizing compiler, on the other hand, is able to manage the complexity of such optimizations and can transparently reconfigure/tune them each time code is modified.
Another advantage of implementing algorithmic code in C is that optimizations can be automatically implemented without substantial code retuning. When an algorithm is hand-coded, it needs to be rewritten to take advantage of new features, whereas C code can be simply re-compiled for the new target.
Improving audio quality by increasing the sampling rate requires a corresponding increase in processing resources. For example, doubling the sampling rate from 48 kHz to 96 kHz results in twice as many samples to be processed in the same amount of time.
It is important to note, however, that there is more involved in selecting an architecture than merely the number of MIPS available for processing. Doubling the processing load also doubles I/O and memory throughput. Additionally, doubling the number of transactions on the system bus can actually require more than double the bandwidth, because of increased contention for system resources. Bus and memory contention also can create scheduling and optimizing issues for developers. Thus, an audio DSP requires an architecture that facilitates data movement, so that the DSP core can easily be kept fed, reducing pressure on the programmer to manage resource utilization.
Because audio codecs, filters and other algorithms process large amounts of data in similar fashions, it is possible to introduce flexibility to the memory management scheme, accelerating commonly implemented operations. Consider the fact that the majority of audio processing algorithms, including reverb, chorus, echos, flanging and wavetable synthesis, among others, require the use of delay lines.
An echo, for example, uses samples from the past to create this well-known sound effect. When processing an algorithm, the DSP must usually implement a series of delay lines for different portions of the signal processing paths. Traditionally, a direct memory access (DMA) engine is used to facilitate efficient access to these delay lines. A DMA engine not only arbitrates contention between concurrent requests for memory accesses, it can also pre-fetch data before it is actually needed (potentially reducing effective read time to zero) and offload data pointer management from the DSP (i.e., the DMA engine manages data pointers, instead of consuming cycles on the DSP to accomplish this task).
DMA engines improve performance by increasing the efficiency of accessing contiguous blocks of memory. However, when multiple delay lines share a common DMA engine, DMA parameters must be reconfigured each time a data from a delay line is required. Thus, to achieve maximum performance, developers would normally dedicate a separate DMA channel per delay line.
High-quality audio systems, however, process multiple channels of audio. Additionally, more complex effects implementing multitap delays require more delay lines that by themselves could consume all available DMA channels. Compounding the memory burden is the fact that there may be several processing algorithms and effects in use at the same time, as well as data movement tasks, such as passing data from one effect or codec to another.
When several lines share a DMA channel, memory access can begin to appear “random” in nature. A multitap filter might require samples T, T-4, T-9, T-17 and T-22. This lack of consistency reduces the efficiency of the DMA engine.
Audio-enhanced DMA engines, such as the dMAX engine in TI's C672x family of DSPs, address the efficiencies delay lines introduce to DMA engines by enabling developers to program multiple “random” accesses through the use of a table guided FIFO transfers. This results in an extremely efficient implementation, since developers can program the DMA engine to make several accesses with a single configuration of the DMA engine. Furthermore, this reduces the load on the DSP core by significantly reducing the number of interrupts issued by the DMA engine. For example, a standard DMA engine will interrupt the DSP core six times for a six-tap filter, while an audio-enhanced DMA engine will trigger a single interrupt. This frees the DSP to completely dedicate itself to processing audio data. An audio-enhanced DMA engine also may implement dual engines to further increase parallelism and performance by enabling concurrent data accesses.
For example, a Schroeder Reverb algorithm might typically require twelve or more different delay lines. When implemented on TI’s TMS320C6727 and recoded to utilize the on-chip dMAX engine, DSP utilization for the Schroeder Reverb dropped from 20 to 5 percent, resulting in a 4X improvement in performance.
The instruction set and instruction pipeline also play a critical role in determining how efficient an audio implementation will be. A full range of single- (32-bit) and double-precision (64-bit) operations enable developers to optimize the number of cycles required for an algorithm. Architectures limited to only double-by-double operations force developers to consume more cycles and potentially more memory. For example, a double-by-double operation may require seven cycles compared to four cycles for a single-by-double operation.
MIMD-based (multiple instruction, multiple data) DSPs, such as VLIW DSP architectures, are essential for efficient multi-channel audio processing. Multiple computational units operating in parallel on multiple data are able to achieve high processing efficiencies at lower operating frequencies, reducing power consumption. The more processing units in the pipeline, and the more flexible their implementation, the greater the parallelism possible and the more easily processing can be distributed among the available units. When each unit is independent of the others, there are fewer dependencies between units and a wider range of configurations is possible, empowering optimizing compilers with more freedom to generate more efficient code.
At the core of the technology enabling mass-market professional audio applications is the availability of 32-bit floating-point DSP architectures designed specifically to address the processing, data, I/O and development bottlenecks faced by audio engineers. Increasing bit width and providing cost-effective floating-point functionality eliminate the difficulties of addressing dynamic range issues, when converting to a fixed-point architecture. With architectures optimized around the use of the C language, developers can focus on designing production code rather that proof-of-concept code that must then be painstakingly hand-optimized and re-optimized whenever code is modified. Finally, backed by architecture innovations that increase code efficiency, such as a streamlined memory architecture and optimized instructions, floating-point enables developers to quickly and cost-effectively bring professional audio quality to a whole new range of consumer applications.
Louis N. Bélanger is the Executive Vice President - Product Development for Lyrtech Inc., as well as a co-founder of the company. He can be reached at (418) 877-4644.
Gerard Andrews is the World Wide Marketing Manager for TI's Pro Audio business. He can be reached at (281) 274-3879.
Product URL: Click here for more information