Speech Recognition Giving Customers Their Say

In the history of the commercial development of speech technology, it's been telephony applications that presented the opportunity to deliver significant return on investment (ROI) to companies like British Telecom and German Telecom. Hence, they've driven the technology's development.

Alongside this, speech recognition for the PC market also developed. Today, the uptake of server-based solutions such as automated query services by both telcos and enterprises is still driven by their ability to deliver a fast ROI.

In the development of speech technology for server-based applications, much focus has consistently been on developing the user interface. Solutions available today are sophisticated enough to be able recognise natural language responses such as: "I would like to go from London to Paris." Then they tacitly test the accuracy of the recognition through the structure of the dialogue that makes use of implicit confirmation strategies, asking for example, "When would you like to go from London to Paris?"

Embedded speech technology, on the other hand, has had a slightly different development path. Embedding speech in a mobile phone is not primarily driven by ROI—it's more about increasing the usability of the phone. Of course, this may ultimately play a role in enabling the phone manufacturer to sell more phones and get a return on its investment, in the same way that people using server-based query solutions can enable people to get more relevant information more quickly. It's not just about ROI.

The more limited resources of embedded systems is another-reason for the divergent development of embedded speech technology. Not only this, but traditionally the kind of interfaces required and the tasks that needed to be performed by embedded-speech applications have been simpler than those in non-embedded speech applications. For example, finding a number in a mobile phone by saying the person's name is a much simpler task and requires a much simpler interface than, say, speaking to a server for location-based services to find out if there is a good Italian restaurant in your immediate area.

Two key technical developments that are enabling the development of distributed solutions are the increasing power of embedded hardware and the greater availability of wireless data connections, such as GPRS and 3G.

In 2006, we will see an increased trend toward off-board navigation systems in both cars and mobile devices. One group of companies involved in the development of these systems consists of Tele Atlas, a specialist in geographic databases; Wayfinder, a provider of mobile navigation solutions on Smartphones; and ScanSoft. At the Cannes 3GSM show this year, the group demonstrated a speech-enabled prototype using distributed speech recognition.

Users can interact with the system through a voice-driven user interface utilising automatic speech recognition technology, text-to-speech, and the server-based X|mode® Multimodal System. The latter enables users to interact with their devices using a combination of speech and keypad entry, integrated with map data.

The distributed system leverages the processing power of the server and the sophistication of the user interface, with the noise robustness of the front-end of the embedded speech engine. The Multimodal System also lessens data traffic compared to only using voice as the interface.

The off-board navigation system with distributed speech recognition allows users, while being connected via a mobile operator, to enter a destination. They can also use natural language to request information about traffic or road works, or find any point of interest (POI). The best route corridor (maybe the actual route ±10 km on either side) can then be sent to the device. While en route, the speechenabled thin client can still allow users to access new information with respect to the route corridor, such as the nearest petrol station.

While speech recognition has to run on the server in an off-board navigation system, text-to-speech will be active during the whole navigation period. It will run more efficiently on the embedded device, as well as avoiding an expensive server connection.

The advantages of such distributed systems is that a thin client can be downloaded onto the users' device without the need for speech-specific hardware—a low startup cost also makes the application more attractive to more users. Another advantage is that much richer, real-time information is available to users, such as up-to-date traffic information, while mobile operators benefit from increased use of their bandwidth.

Where speech-enabled navigation systems started entirely on-board, SMS reading and SMS dictation were server-based. At the other end of the scale, entirely on-board, non-continuous, vocabulary-limited SMS dictation was previewed at 3GSM in Cannes earlier this year. ScanSoft also previewed a distributed solution in development, which provides a multimodel user interface that supports both the continuous dictation and error correction on the mobile device. This enables users to have an on-board experience, while the majority of the processing is done on the server.

In the short and medium term, distributed systems are going to become the new way that embedded, wireless, and server technology are best utilised to deliver effective services to end users. While some mobile applications will continue to be delivered almost exclusively using embedded technology, others will make use of an increased number of off-board services. Ultimately, distributed systems enable us to get the best of both worlds using the technology available, and they enable end users to receive the best services and experience possible.

See associated figure