Speech Recognition Comes Of Age

Speech recognition is a re-emerging technology. It survived the word-spotting '70s, the template-matching '80s, the trainable but unreliable cell-phone voice tags, and the Hidden Markov Model-based dictation programs of the '90s. Yet it only found a market in Interactive-Voice-Response-enabled call centers. Today, we talk to Interactive-Voice-Response systems when we make airline reservations, track packages, or call our local mortgage broker. In simple transactions, these systems perform well. They also offer efficiency savings to the providers of services.

These systems don't endear themselves to users, however. In addition, they suffer high opt-out rates for any complicated transactions. With a few exceptions in vertical markets and among the typing-impaired, dictation systems are shelfware. They've failed to enhance the capabilities that exist in the mouse-screen-keyboard interfaces found in computers.

In handheld devices, the value proposition is different. Cell phones are shrinking and PDAs have fewer and fewer keys. Fingers aren't scaling down at the same pace. Plus, device interfaces are difficult to decode and even harder to use.

Embedded speech systems have the potential to offer intuitive, usable interfaces in cell phones and PDAs. Computing power and memory will keep expanding in these devices. In the meantime, advances in algorithms, smart engineering, and human-machine interfaces can save time and torture for the small-device aficionado. As speech systems have come of age, so has their appeal to handheld-device users, manufacturers, and service providers. They're starting to put these systems to work.

Earlier this year, for instance, Samsung launched the i700 Pocket PC and wireless-phone combination device. With it, the company released the first speaker-independent name-recognition system. This system enables a voice-activated phone book and dialer. The embedded speech algorithm has the capability to find any name in the user's contact list. It simply matches the speaker's acoustics with a model that was generated from the text entered into the contact list. The user can use the lookup capability to make a phone call or simply point to all of the information for the named contact. Since that original release, this interface has been offered on more than 10 cell-phone models and in several languages.

What "secret sauce" makes these speaker-independent applications different from earlier speech recognizers? Part of the story is in maturing algorithms. Hidden Markov Model-based speech recognition has supplanted the quirky template-matching that was provided in "dial by name, but after training" phones. At the heart of the system is a phonetic recognizer that "knows" how sounds are represented acoustically. It can therefore search a list for the closest-sounding entry.

A second factor is a new look at engineering speech systems. Several companies now design speech systems to run on small microprocessors. They no longer attempt to port larger "fat" recognizers to these limited resources. By engineering these systems with the knowledge that they will be embedded, designers can attain superior performance with very limited computation. This approach has led to substantially enhanced capabilities.

Finally, successful systems take advantage of the multi-modal interfaces that exist on mobile devices. In modern user interfaces, speech input and output are combined with buttons, buzzers, a touchscreen, and text displays. In embedded systems, this multi-functionality is easy to implement. Its full functionality is only now becoming evident.

What will voice recognition enable in tomorrow's handheld devices? Name and number recognizers—together with their command-and-control glue—started being launched in consumer cell phones, PDAs, and other platforms in mid-2003. Speech-to-text algorithms are around the corner. With them comes the promise of XML-based enterprise interactions from wireless devices, which will support everywhere/anytime computing. Support also is due to arrive for intuitive, easy-to-use messaging systems from handheld devices with all kinds of form factors.

The appeal of speech-to-text algorithms on handheld devices is obvious. Enterprise users can gain access to all kinds of corporate data simply by launching a Web browser from their handhelds. Or users can engage their communicants with SMS or Instant Messaging. With the correct middleware, enterprise users will be able to view corporate intelligence—contracts, client sales records, and other networked documents—from their cell phones or PDAs. Business efficiencies will then dramatically increase.

Once speech to text is a reality, Web interfaces will become as functional on a small device as they are on a desktop. They'll be subject only to the ability of the device to display Web pages. The world can then become a rich information store. Finally—if history is a guide—a natural, fluid interface to messaging will engender the widespread use of SMS and Instant Messaging on wireless devices. The next chapter of embedded speech recognition will then be revealed.