Audio Codecs—The Entertainment-DSP Connection

Standard compression algorithms and proprietary post-processing code challenge DSP chipmakers to deliver more MIPS at fewer watts.

Don Tuite

June 23, 2005

12 min read

There are audio codec ICs that comprise an analog-to-digital converter (ADC) and a digital-to-analog converter (DAC) around some processing hardware, but they are the tip of the proverbial iceberg. These days, most engineers think of audio codecs in terms of compression and decompression algorithms that run on DSP platforms. In the range of audio frequencies, "voice" codecs suit digital telephony, while "audio" codecs (as the term is generally understood) fit entertainment audio.

In that sense, audio codecs evolved from Dolby Labs' noise-suppression technologies into Dolby's and others' more comprehensive compression schemes. They also emerged from the International Standards Organization's Motion Picture Experts Group (MPEG), a working group that develops standards for coded representation of digital audio and video.

While voice-codec technology is somewhat static (no pun intended), audio-codec technology continues to evolve (Fig. 1). For one thing, there's a trend toward ever more channels of surroundsound. On top of that, add in the technologies for simulating multichannel audio in binaural systems and post-production techniques to replicate the "presence" of specific live concert venues. Then there's the challenge of doing all that processing—not with DSP engines running off of the fat power supplies inside AV receivers or DVD players, but rather off the batteries in a cell phone or PDA. Combine it all, and you have an interesting story in terms of different but converging applications areas—personal audio and home theater.

PERSONAL AUDIO Randy Cole, chief technology officer for Texas Instruments' portable audio/infotainment business, notes that the most ubiquitous audio codec in the personal audio space is MP3 (Fig. 2). Once limited to PCs and portable media players, it's now closing in on the cell phone industry as makers continuously search for new functions that induce end users to upgrade every six months.

MP3 is a standard published by the International Standards Organization (ISO). It was the third in a series of MPEG-1 codecs developed by MPEG. MPEG-1 had three layers, each of which included the previous layer. So, layer 3 incorporates layers 1 and 2. Out of all this, MPEG-1, layer 3, popularly known as MP3, offered a low enough bandwidth and data rate that it could be used in portable applications.

Over the past decade, MP3 has been the dominant audio codec for downloading music. But Apple's iPod has thrown the spotlight on a new MPEG development called the Advanced Audio Encoder, or AAC. MPEG began working on AAC in the mid-1990s as part of its next-generation MPEG-2 effort, incorporating the best new ideas developed at AT&T, Dolby, Fraunhof, and Sony. Originally, it was to be backward-compatible with MP3, but that goal could not be achieved.

Therefore, because the industry was heavily involved with MP3 and no one would commit resources to make a large amount of audio content available for the new scheme, it languished. That is, it languished until Apple picked the MPEG-4 version of AAC for the iPod. (The next MPEG group after MPEG-2 was MPEG-4, skipping 3. MPEG-4 AAC, which is what Apple uses, is an enhanced version of MPEG-2 AAC with a lower data rate and improved quality.)

Other proprietary coders exist beyond MP3 and AAC. They have some penetration in PCs and personal media, but they're less important in the cellular space because the cellular makers prefer standard encoders and their fixed royalties. One familiar proprietary encoder is Windows Media Audio (WMA). It is primarily used in PCs, where it competes with MP3 and AAC. However, it is flexible in terms of data rate (low to high with appropriate differences in quality). Now, there also is WMA-Pro, the multichannel version, and Microsoft recently announced a lossless WMA.

The other significant proprietary encoder is Dolby Digital, also known as AC3. That's the coder used for DVDs and (in the U.S.) digital TV. Until recently, it ran at data rates too high for the Internet or cell phones. Coming to the rescue, though, is a new version that drives the data rate lower.

According to Mohsin Imtiaz, marketing manager for TI's performance audio business, the primary codecs in the home-theater arena are Dolby and DTS. Dolby has announced Dolby Digital Plus, which targets HD DVD and Broadcast. But there's crossover with portable standards such as MP3, AAC, and WMA. For next-generation DVD, Microsoft is pushing WMA.

ANATOMY OF A CODEC Let's take a codec apart. For the whole story, take a look at a paper presented last October at the Audio Engineering Society's annual convention that described Dolby Digital Plus. (To read the paper, go to www.elecdesign.com and enter Drill Deeper 10564 for a link to the paper's Web site.)

The paper says the new codec is based on the earlier version of Dolby Digital, called AC-3. Dolby Digital Plus, or Enhanced AC-3 (E-AC-3), preserves the metadata carriage, filterbank, and framing structure. Data rates now range from 32 kbits/s to 6.144 Mbits/s. The data-rate control has resolution down to a third of a bit per second, with a sample rate of 32 kHz and a six-block transform frame. (Data-rate resolution is proportional to sample rate and inversely proportional to frame size.)

E-AC-3 preserves the AC-3 frame structure of six 256-coefficient transforms but permits shorter frames comprising one, two, and three 256-coefficient transform blocks. As a result, audio transport can operate at greater than 640 kbits/s with certain DVDs that limit the amount of data per frame.

E-AC-3 can support today's 5.1, 6.1, or 7.1 channels, all the way up to a movie theater's 13.1 channels. The main audio program bit stream plus up to eight additional substreams are multiplexed into a single E-AC-3 bit stream. Matrix subtraction-induced coding artifacts are eliminated through channel substitution. E-AC-3 carries seven more independent bit streams than AC-3.

Coding efficiency also was increased with a new filterbank, better quantization, enhanced channel coupling, spectral extension, and a technique called "transient pre-noise processing."

When audio with stationary characteristics is present, the filterbank adds a second-stage DCT after the existing AC-3 filterbank. This converts the six 256-coefficient transform blocks into a single 1536-coefficient hybrid transform block with increased frequency resolution. This increased frequency resolution combines with six-dimensional vector quantization (VQ) and gain adaptive quantization (GAQ) to improve the coding efficiency for "hard to code" signals, such as pitch pipe and harpsichord.

VQ is used for frequency bands requiring lower accuracies. GAQ is more efficient when there's a need for higher-accuracy quantization. Also, coding efficiency gets a boost through channel coupling with phase preservation. Where AC-3 used a high-frequency mono composite channel for high frequencies on each channel, adding phase information and encoder-controlled spectral amplitude processing lets the mono composite channel handle lower frequencies, decreasing the effective bandwidth encoded and increasing coding efficiency.

Spectral extension replaces the upper-frequency transform coefficients with lower-frequency spectral segments translated up in frequency. The spectral characteristics of the translated segments are matched to the original through spectral modulation of the transform coefficients.

To improve audio quality at low data rates, E-AC-3 uses transient pre-noise processing. This post-decoding process minimizes pre-noise error through time-scaling synthesis techniques that reduce pre-noise duration and, therefore, the audibility of transient artifacts. Metadata calculated by the encoder and transmitted in the E-AC-3 bit stream provides the parameters for post-decoding, time-scaling synthesis processing, which employs auditory scene analysis.

POST PROCESSING The proprietary algorithms for post processing in audio codecs are as important as compression standards like Dolby Digital Plus, which are the same for any licensee. In this arena, the algorithms operate on the information carried by the multichannel standards, turning the home-theater into any kind of listening venue—from a vast cathedral to an outdoor rock concert to an intimate jazz club.

According to Analog Devices' SigmaDSP product manager Thomas Irrgang, post processing is all about the OEMs' quest to achieve a "signature sound." Possibly the first to do this successfully was THX. Other post-processing firms include SRS, starting with TruSurround (a presence enhancement for binaural systems), TruSurround XT, and TruBass.

In the TV space lies BBE, with BBE 3D and BBE MP, a post processor that recovers MP-3 encoding losses. There's also BBE Viva, a specialized algorithm for TVs, where the stereo speakers are typically too close together for good binaural listening.

Bass enhancement is becoming important in portable systems and TVs, where there isn't room for large speaker drivers. Stepping up to the plate is MaxxBass from WAVES, perhaps the most popular bass-enhancement algorithm, because it enhances the subjective level of bass in the material being processed without adding any low-frequency energy.

Dolby Labs is prominent in post processing with its Virtual Speaker and Dolby Headphone algorithms. Dolby says its technology duplicates its multiple sonic signature, including reflections, while providing crosstalk cancellation to keep the surround cues for each ear from being cancelled by the other speaker's cues. Virtual Speaker and Dolby Headphone were originally developed by Lake DSP in Australia. Lake is now a Dolby subsidiary.

Of course, there is no free lunch. Algorithms such as Virtual Speaker only emulate the presence of missing speakers for a relatively small volume of the room called the "sweet spot." Outside the sweet spot, the surround information collapses. The sound isn't bad, but just seems like it's coming from a conventional binaural speaker setup, which is the case.

To recreate physical spaces, Panasonic's Web site says that its Hall Mode "reproduces the reverb effects for a spacious sound spreading around you," and it's "(p)articularly effective when enjoying orchestral performances in a concert hall." One reviewer of a Yamaha A/V receiver said Yamaha's Concert Hall "does add an extra dimension—a sense of height to the sound field, which the Concert Hall mode even lets you adjust to taste. With a favorite stereo recording of Mahler's Fourth Symphony, the Concert Hall mode was gorgeously lifelike."

ROOM CORRECTION The next step beyond virtualization in the post-processing realm is room correction. This feature began to appear in high-end multichannel systems two or three years ago, and it has migrated down to the middle of the price/performance spectrum. It can be critical to user satisfaction in an apartment's home-theater system, where it's impossible to set up the left and right speakers symmetrically or where one wall is acoustically different from its opposite. Multichannel systems benefit from it the most. It may have a small effect in two-channel setups, but binaural systems are generally less sensitive to misadjustments and misalignment.

Room correction involves bringing the system into its TEST mode, sitting a mike at the preferred listening position, running a sequence of test sounds that reveal information about the room acoustics and the limitations of the speakers themselves, and adjusting gains and equalizations according to proprietary algorithms. One interesting area of crossover for room-tuning technology is in automotive applications. Expensive cars have been acoustically tuned for best performance from OEM sound systems for years, but it's been a fairly labor-intensive and subjective manual process.

Car makers have begun to adapt automated listening-space tuning. Beyond the rich listening experience, one very important aspect of such tuning turns out to be acoustic echo cancellation. This means eliminating feedback from hands-free speakers into microphones mounted in sun visors or overheads.

CELL PHONES AND PERSONAL MEDIA CONVERGE MP3 and AAC data rates are adequate for personal media players and PCs. But for cellular telephony, the data rate must be lower. And, obviously, streaming audio over a cellular link has different requirements than downloading music files for later playback.

The 3G cell-phone standards set by the Third-Generation Partnership Program (3GPP), which adopted AAC as the audio codec standard, provided for both of these applications. The latest revision (revision 6) of the standard allows the use of either of two codecs. One is an enhanced version of AAC called AAC-Plus or High Efficiency AAC (HE-AAC).

Recently, a second set of enhancements known as Enhanced AAC Plus or HE-AAC, Version 2 was added. The other option is "AMR Wideband Plus," which is an enhanced speech encoder. AMR wideband is a very popular speech codec for GSM cell phones. The Plus version extends it to handle music.

Peter Frith, VP of new product development at Scotland's Wolfson Microelectronics, which makes the hardware version of audio codecs, notes that adding the ability to play back MP3 either as downloaded music or as high-quality ringtones to cell phones means that users have wanted to see a relatively hi-fi playback system in the cell phone. Therefore, the cell-phone manufacturers now expect even low-power portable DACs to be capable of a 100-dB signal-to-noise ratio.

At the moment, media-capable cell phones just happen to be able to play back MP3. What's next may be personal media players that bundle phone capability along with Bluetooth and Wi-Fi. In between, the PDA phone has already emerged into this space, providing some capability for play back of video or sound files.

IS THAT A CHOIR IN YOUR POCKET? Frith also says that in the past, designers provided cell phones with very simple ringtones or with MIDI ringtone capability. With MIDI, the hardware solution has generally been a separate MIDI decoder chip.

Typically from Yamaha, that chip decodes the MIDI file, converts the result into PCM audio files, and plays them back through a DAC. That DAC may be integrated into the same chip or it may be discrete. An alternative way to do that would be to put a software MIDI decoder into the phone's processor.

The interest in playing back higher-quality ringtone sound has led to some phones for the Japanese market using MP3 files for their ringtones. Those are handled just like MP3 files for a music device. They're processed through the hi-fi audio system.

In most phones today, that hi-fi system is a separate device, comprising a stereo DAC, a stereo ADC, headphone and loudspeaker drivers, mike preamps, and so on. But companies like Wolfson now look to integrate it all so that the processor can deal with voice playback and ringtones, voice signals, connection to Bluetooth, and more (Fig. 3).

NEED MORE INFORMATION? Analog Devices
www.analog.com

BBE
www.bbesound.com

Dolby Laboratories
www.dolby.com

Fraunhofer IIS
www.iis.fraunhofer.de

High-Definition Multimedia Interface (HDMI)
www.hdmi.org

Motion Picture Expert Group (MPEG)
www.chiariglione.org/mpeg

SRS
www.srslabs.com

Texas Instruments
www.ti.com

Third-Generation Partnership Program (3GPP)
www.3gpp.org

THX
www.thx.com

WAVES/MaxxBass
www.maxxbass.com

Wolfson Microelectronics
www.wolfson.co.uk