Auto Electronics

Voice Recognition Facilitates Multitasking and Focus on the Road

Voice-recognition systems are already an integral part of hands-free dialing in Bluetooth-enabled products or GM's OnStar system, but the amount of commands and the ability to handle vocabulary and dialect variations are limited. More powerful algorithms, sophisticated hardware and ongoing research could make continuous speech recognition a reality in the future.

For the PDF version of this article, click here.

It is not just a matter of hearing, but also of understanding and then taking the proper actions for any voice/speech-recognition system. The entry point for vehicles has been hands-free voice input for cell phones, especially with Bluetooth or OnStar systems. Cell phone connectivity is also the fastest growing area for voice-activated command and control since hands-free operation has become a legal requirement in many countries and states. As shown in Figure 1, the basic components of this system are rather straightforward.

Bluetooth technology is a small, initial step on the voice-recognition (VR) path because the commands are simple and limited. However, a trend that has emerged are products for customers who just want to have a hands-free add-on module for the vehicle. When navigation systems with various addresses and more complex commands and dialects come into the picture, the degree of difficulty increases. Sophisticated VR systems can also provide input to control passenger compartment temperature and radio volume and input for driver authentication and theft detection.

HARDWARE AND SOFTWARE

There are two approaches to voice recognition: speaker independent and speaker dependent. Early VR systems used isolated-word, speaker-dependent speech recognition. This meant training the system with the voice input of a specific user and using only single words. The user could not combine call and “name” input together to form a phrase.

The trend that is occurring today requires suppliers of speech-recognition hardware to provide a path from entry-level VR to complex systems that include VR as part of the processing. Customers want the software infrastructure to be in place so they can pursue their product roadmap without changing from one processor type to another. Other functions that the sophisticated processor could be called on to handle include:

  • Bluetooth connectivity to a local phone;
  • speech recognition;
  • text to speech;
  • noise and echo cancellation through single or array microphones;
  • GPS location management;
  • dead reckoning;
  • MP3 playback from a CD drive/Flash Card/HDD;
  • block decoding of CD drive data; and
  • voice-activated control.


THE VR SYSTEM

In today's vehicle, the radio head unit space is valuable. Systems such as telematics, hands-free module box, and navigation all target this same location. According to Bob Pinteric, operations manager of Freescale Semiconductor's Infotainment, Multimedia, and Telematics Division, the goal is to integrate everything into one box. With this in mind, semiconductor companies like Freescale have developed advanced processor platforms as shown in Figure 2.

Freescale's MPC5200B operates at 466 MHz with 885 MIPS. The processor has a programmable serial control module that can be configured to perform a number of system functions including UART, SPI and others. The unit provides a high-speed UART connection to a Bluetooth chip that communicates with the cell phone. Another I/O provides the interface to AC97 for the microphone. A speaker could also interface to an AC97 driver. However, a more advanced system would have a CAN bus communicating with the radio that would drive the speakers.

To provide a single-package development system, companies have partnered to provide hardware, the various aspects of software and development tools (for both hardware and software) for users. For example, Freescale's Media5200 development platform has an audio subsystem based on Realtek ALC658, AC97 codec. Trial software in the system allows users to start their development with the software they have chosen or to evaluate other software. The development tool provides full connectivity to the car radio connecting two microphone inputs, Bluetooth technology, stereo, radio, phone, media-oriented systems transfer (MOST), video, and aux and line inputs. In addition, the tool connects advanced graphics and has a DSP for advanced audio. Hardware development can be performed using Freescale's Lite 5200B evaluation board (EVB), which provides a starting point that users can add to or connect in a variety of configurations.

Partnerships are essential to provide complete development support. As shown in Table 1, Freescale's MobileGT partnership provides a platform, a software ecosystem, built on Freescale silicon. The Bluetooth software stack comes from one of three partners. Note that these partnerships are not normally exclusive, so other silicon suppliers frequently have partnerships with some of the same software suppliers. The VR software drives most of the MIPS requirements for a Bluetooth phone. On the MPC5200B, this is about 100 MIPS.

Another reference design with sample code is the result of a combined effort between Analog Devices Inc. (ADI) and ScanSoft. The design uses ADI's Blackfin chipset and ScanSoft's voice-activated dialing commands. The design demonstrates how the technology works to provide voice-activated dialing. This includes storing numbers and names in an address book and being able to say short phrases such as “call Mark at home” or “call Jeff at the office,” and the system looks up and provides information without having to remember what numbers to dial. This is different from systems that step through a series of voice responses.

Based on advances in semiconductor technology, products providing a much higher level of performance are available to use in an economical way so the most advanced speech recognition can be developed for the vehicle. According to Mark Gill, Frontline director telematics infotainment systems product line, Analog Devices, VR probably uses about one-fourth to one-fifth of the MIPS available on the Blackfin processor to run these types of speech-recognition algorithms. Since Blackfin processors are generally 400 MHz to 500 MHz-type processors, this allows performance for other functions that users might want, including voice entry and connection to Bluetooth, as well as performing echo cancellation and noise reduction during the phone call. Voice commands could also be used in the navigation system for destination entry, and the processor could support the other aspects of navigation as well. Figure 3 shows the MIPS allocation for many of these functions. Speech recognition always uses external memory for all the required databases. For voice-activated dialing, the memory is in the area of 1Mb.

Table 1. Partners in Freescale's mobileGT development platform.
Required Components mobileGT Partners
Voice Recognition & Text to Speech IBM Via Voice Fonix ScanSoft
Echo Cancellation & Noise Reduction Clarity Wavemaker ---
Blue Tooth Stack (Communication to Phone) ESI Stonestreet OpenInterface


ALGORITHMS

In addition to A to D and D to A converters and a lot of processor MIPS, VR requires a combination of software algorithms. “The advances in the silicon side have to match with the advances in the software side,” said Jeff Foley, marketing manager, Embedded Speech at ScanSoft. ScanSoft and ADI have partnered to bring these advancements together.

Advanced techniques for developing VR algorithms include hidden Markov models (HMMs) and artificial neural networks (ANNs). HMMs and ANNs use a training phase to provide improved recognition. Dynamical Time Warping (DTW) is an approach that places fewer requirements on the VR hardware. DTW finds an optimal match between two given sequences by time warping non-linearly to match each other. This sequence alignment is often used in the context of HMMs. HMM, ANN and DTW are the basic techniques used by software companies today to generate VR algorithms.

Much of the technology from the speech side ensures that the grammar is in place to support voice-activated dialing. ScanSoft's embedded development system shows the technology required to provide this level of VR capability. Its VoCon 3200 is designed specifically for embedded solutions, and automotive is simply one of the end applications. Table 2 provides an example of the tools ScanSoft provides to ease and expedite development.

Software suppliers such as ScanSoft constantly make changes to shrink the software footprint and to increase the power and accuracy of voice recognition with better techniques in the algorithms. The approach and improvements come from acoustic and language modeling. This includes the acoustic aspect of recognizing speech — breaking it down into individual phonemes, pieces of words. On the language side it requires determining how words are put together and when they should be used — the grammar. Finally, it requires the best way to express this from a user-interface side to complete a specific task as quickly and easily as possible. Figure 4 shows how these aspects come together. One of the improvements over the last few years is the ability to use the VR system without training the system to new users. The ADI/ScanSoft reference design provides this speaker-independent VR.

Another important system aspect is text-to-speech software to provide feedback to the user with speech synthesis. This is part of technology in the ADI/ScanSoft reference design. Recording everything that could be said is not practical today with all the possible names and destinations. To address this, a real speech text-to-speech engine has been scaled to fit within the ADI/ScanSoft reference design that allows a more natural conversation between the user and the system rather than confirmation from a display that would require taking the driver's eyes off the road.

Table 2. Software development tools from ScanSoft in the VoCon 3200 embedded development system.
Grammar and Pronunciation Editing and Analysis Suite Recognition Analysis Suite
· Grammar Editor
· Grammar Creator Tool
· Grammar Compiler Tool
· Context Compiler Tool
· Spelling Tree Compiler Tool
· Dictionary Compiler Tool
· Model Compiler Tool
· Vocabulary Verifier Tool
· Context Verifier Tool
· Confusability Tool
· Recognition Test Tool
· User Dictionary Editor
· Audio Data Collector
· Log Importer Tool
· Log Extractor Tool
· Batch Recognition Tool
· Batch Userword Training Tool
· Batch Speaker Adaptation
· Enrollment Tool
· Scoring Tool
· Sound Tool
· Speech Verifier Tool


ECHO AND NOISE CANCELLATION

Echo and noise cancellation are an important part of any speech implementation in an automobile to make sure that there is a clear input into the microphone. The different characteristics between noise and speech and the process of separating speech from the noise are not dependent on the type of noise. Specific improvements can be made to tune out different types of noise in a specific vehicle that are unique to the vehicle environment.

As shown in Figure 5, ScanSoft's VoCon AEC performs acoustical echo cancellation and noise reduction in the frequency domain. For noise reduction, the echo return loss is 40 dB minimum, and for acoustic echo cancellation, the typical noise suppression is 10dB for the best speech quality. The algorithm has been developed for Texas Instrument's TMS320C54xx, Infineon's Tricore 19xx and ADI's Blackfin 535, and is portable to other DSPs.

SPEAKING OF THE FUTURE

In the future, users can expect to have a VR interface that is more capable of understanding them with more variations of what is said based on anticipating the kind of question that would be posed by the user in a specific circumstance. Other VR applications will occur in entertainment including the ability to select a specific song, artist, album, or type of music from an MP3 play list. ScanSoft has already demonstrated this capability.

For one of its next-generation systems, Magneti Marelli Electronic Systems will add voice control of the automobile navigation system using ScanSoft's VoCon 3200 embedded recognition system. Drivers will be able to enter a destination by voice from a large directory list, such as the 70,000 city names within Germany.

Further into the future, dictating a memo or a text message and then connecting to full dictation technology will be possible. The technology exists today, so it is really a matter of how long will it take to get designed into the vehicle. It could happen this decade.




ABOUT THE AUTHOR

Randy Frank is president of Randy Frank & Associates Ltd., a technical marketing consulting firm based in Scottsdale, Ariz. He is an SAE and IEEE Fellow and has been involved in automotive electronics for more than 25 years. He can be reached at [email protected].

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish