A Guide to Choosing the Right Signal-Processing Technique

What you’ll learn:

What signal-processing techniques are available?
How do the different signal-processing techniques work?
Tips on choosing the right signal-processing technique for your application.

Noise is all around us—at work and at home—making it difficult to pick out and clearly hear one voice amid the cacophony, especially as we reach middle age. Electronic devices have the same issue: Audio signals picked up by their microphones are often contaminated with interference, noise, and reverberation. Signal-processing techniques, such as beamforming and blind source separation, can come to the rescue. But what’s the best option, and for which applications?

Intelligible speech is crucial for a wide variety of electronic devices, ranging from phones, computers, hearing-assistance devices, and conferencing systems to transcription services, car infotainment, and home assistants. But a one-size-fits-all approach isn’t the way to get the best performance out of such widely different devices.

Variations in factors such as the number of microphones and the size of the microphone array will have an effect on which signal-processing technique is the most appropriate. The choice requires consideration not just of the performance you need, but the situation in which you need the application to work, as well as the physical constraints of the product you have in mind.

Audio Beamforming

Audio beamforming is one of the most versatile multi-microphone methods for emphasizing a particular source in an acoustic scene. Beamformers can be divided into two types, depending on how they work: data-independent or adaptive.

One of the simplest forms of data-independent beamformers is a delay-and-sum beamformer, where the microphone signals are delayed to compensate for the different path lengths between a target source and the different microphones. This means that when the signals are summed, the target source coming from a certain direction will experience coherent combining, and it’s expected that signals arriving from other directions will suffer, to some extent, from destructive combining.

However, in many audio consumer applications, these types of beamformers will be of little benefit because they need the wavelength of the signal to be small compared with the size of the microphone array. They work well in top-of-the-range conferencing systems with microphone arrays 1 m in diameter containing hundreds of microphones to cover the wide dynamic range of wavelengths. But such systems are expensive to produce and therefore only suitable for the business conferencing market.

Consumer devices, on the other hand, usually contain just a few microphones in a small array. Consequently, delay-and-sum beamformers struggle as the large wavelengths of speech are arriving at a small microphone array. A delay-and-sum beamformer the size of a normal hearing aid, for example, can’t give any directional discrimination at low frequencies, and at high frequencies it’s limited in its directivity to a front/back level of discrimination.

Another problem is the fact that sound doesn’t move in straight lines—a given source has multiple different paths to the microphones, each with differing amounts of reflection and diffraction. This means that simple delay-and-sum beamformers aren’t very effective at extracting a source of interest from an acoustic scene. But they’re very easy to implement and do give a small amount of benefit, so they were often used in older devices.

Adaptive Beamformers

Adaptive beamformers are a more advanced beamforming technique. One example is the minimum variance distortionless response (MVDR) beamformer. It tries to pass the signal arriving from the target direction in a distortionless way, while attempting to minimize the power at the output of the beamformer. This has the effect of trying to preserve the target source while attenuating the noise and interference.

Such a technique can work well in ideal laboratory conditions, but in the real world, microphone mismatch and reverberation can lead to inaccuracy in modeling the effect of the source location relative to the array. The result is that these beamformers often perform poorly because they will start cancelling parts of the target source.

A voice activity detector could be added to address the target cancellation problem, and the adaptation of the beamformer can be turned off when the target source is active. This typically works well when there’s just one target source. However, if there are multiple competing speakers, this technique has limited effectiveness.

In addition, MVDR beamforming—just like delay-and-sum beamforming and most other types of beamforming—requires calibrated microphones, as well as knowledge of the microphone array geometry and the target source direction. Some beamformers are very sensitive to the accuracy of this information and may reject the target source because it doesn’t come from the indicated direction.

Many modern devices use another beamforming technique called adaptive sidelobe cancellation, which tries to null out the sources that aren’t from the direction of interest. These are state-of-the-art in modern hearing aids, allowing the user to concentrate on sources directly in front of them.

The significant drawback, though, is that you must be looking at whatever you’re listening to. That may be awkward if your visual attention is needed elsewhere—for example, when you’re looking at a computer screen and trying to discuss what you see with a colleague.

Blind Source Separation

An alternative approach to improving speech intelligibility in noisy environments is to use blind source separation (BSS) (see video below). Time-frequency masking BSS estimates the time-frequency envelope of each source and then attenuates the time-frequency points that are dominated by interference and noise.

Another type of BSS uses linear multichannel filters. The acoustic scene is separated into its constituent parts using statistical models of how sources generally behave. BSS then calculates a multichannel filter whose output best fits these statistical models. In doing so, it intrinsically extracts all of the sources in the scene, not just one.

The multichannel filter method can handle microphone mismatch and will deal well with reverberation and multiple competing speakers. It doesn’t need any prior knowledge of the sources, the microphone array, or the acoustic scene, since all of these variables are absorbed into the design of the multichannel filter. Changing a microphone, or a calibration error, simply changes the optimal multichannel filter.

Because BSS works from the audio data rather than the microphone geometry, it’s a very robust approach that’s insensitive to calibration issues and can generally achieve much higher separation of sources in real-world situations than any beamformer. And, because it separates all sources irrespective of direction, it can be used to automatically follow a multi-way conversation.

This is particularly helpful for hearing-assistance applications, where the user wishes to follow a conversation without having to interact with the device manually. BSS also can be very effective when used in VoIP calling, home smart devices, and in-car infotainment applications.

BSS Drawbacks

However, BSS is not without its problems. For most BSS algorithms, the number of sources that can be separated depends on the number of microphones in the array. In addition, because it works from the data, BSS needs a consistent frame of reference.

As a result, it limits the technique to devices that have a stationary microphone array. Examples include a tabletop hearing device, a microphone array for fixed conferencing systems, or video calling from a phone or tablet that’s being held steady in your hands or on a table.

When there’s background chatter, BSS will generally separate the most dominant sources in the mix, which may include the annoyingly loud person at the next table. So, to work effectively, BSS needs to be combined with an ancillary algorithm to determine which of the sources are the sources of interest.

On its own, BSS separates sources very well, but it doesn’t reduce the background noise by more than about 9 dB. To obtain really good performance, it must be paired with a noise-reduction technique.

Many solutions for noise reduction use artificial intelligence (AI)—it’s utilized by Zoom and other conferencing systems, for example—to analyze the signal in the time-frequency domain. Then it tries to identify which components are due to the signal and those that are due to noise. This can work well with just a single microphone. The big problem with this technique, though, is that it extracts the signal by dynamically gating the time-frequency content, which can lead to unpleasant artifacts in poor signal-to-noise ratios (SNRs), and it may introduce considerable latency.

A low-latency noise-suppression algorithm combined with BSS, on the other hand, gives up to 26 dB of noise suppression and makes products suitable for real-time use—with a latency of just 5 ms and a more natural sound with fewer distortions than AI solutions. Hearing devices, in particular, need ultra-low latency to keep lip sync; it’s extremely off-putting for users if the sound they hear lags behind the mouth movements of the person they are talking to.

The number of electronic devices that need to receive clear audio to work effectively is rising every year, and the advent of Bluetooth Low Energy will accelerate this process. Now, more than ever, choosing the right signal-processing technique for each different application is vital.