It's Time To Get In Touch With Your Senses

The wireless Web has certainly received a lot of attention—and for good reason. There are approximately 700 million wireless subscribers today. In three years, this number is expected to double. Market-research firms estimate that mobile commerce will enjoy revenue ranging in the tens to hundreds of billions of dollars by 2005.

Yet despite such bullish projections, the wireless Web is struggling. A central reason for this severe mismatch between expectations and reality is the cumbersome user interface of mobile handhelds. Addressing this problem is critical for the success of the wireless Web.

The wireless Web enables mobility and frees the user from being tied down to the desktop. However, the convenience of being able to access Web applications while on the move also makes data input on small keypads and screens a major challenge. Having to focus on a handheld display's tiny text is stressful and cumbersome. Indeed, working with a keypad and a screen may not even be an option while driving a car. The wireless Web thus creates a significant human-computer interface challenge.

One way to address this issue is to broaden the visual or graphical user interface (GUI) by adding a speech user interface to it. This notion of employing multiple senses or modes of human-computer interaction for input and output is called "multimodality." It is one of the hottest emerging technologies and merits a closer look.

A speech interface has extremely different characteristics compared to a GUI. This disparity suggests that the design of a multimodal interface will be challenging. But it also implies that these dissimilar interfaces have very different strengths. The idea of a combined multimodal interface thus becomes powerful and compelling.

Take, for example, the task of obtaining driving directions on a mobile handheld. It's usually a lot easier to input the destination address using speech instead of a keypad. But it's much more effective to list the turns through a visual display. That approach is easier than having to remember spoken driving instructions with a bunch of unfamiliar street names. Clearly, using a multimodal interface that allows the flexible use of speech, touch, and vision tends to be a lot less stressful. It also increases task-completion rates while decreasing task-completion time.

Of course, the above example assumes a reasonably accurate speech-recognition system and a high-quality speech synthesizer. Such products are closer to becoming a reality in recent years, thanks to fundamental improvements in speech-recognition algorithms and speech-system design. Moore's law also has helped bring down the expense of commercially available speech systems.

It will, however, be several years before high-quality speech recognition and synthesis can be embedded in handhelds. Until then, speech interfaces will need network-based speech engines to enable multimodal interaction on handhelds. This fact creates significant architectural and business implications for the mobile operator, application provider, and end user.

It's challenging to build and deploy scalable, multimodal platforms and solutions for the multitude of handhelds, gateways, and networks that exist today or will exist tomorrow. Companies such as OnMobile, Lobby7, and Kirusa have taken an early lead in building multimodal platforms. The core problem that they face is the coupling and integrating of multiple modes or interfaces using the handheld display and remote speech resources. Yet many practical real-world issues of wireless access networks need to be addressed as well. Plus, the widely varying capabilities of handhelds enforce particular and sometimes peculiar forms of multimodality.

A multimodal wireless Web needs a multimodal markup language. Speech Application Language Tags (SALT) is a speech-interface markup language that combines nicely with HTML. It has been submitted for consideration to the W3C—the keeper of Web standards—as a proposal for the basis of a multimodal markup language.

The W3C itself has chartered a Multimodal Interaction Working Group. That group is in the process of defining requirements for the development of a multimodal markup language. It is being ably assisted by contributions like SALT from its member companies. Other industry players have proposed different approaches for defining multimodal languages, such as reusing VoiceXML.

It appears that the multimodal market will be poised for takeoff next year. Rapid growth is expected in the multimodal-platform and application-development sectors. Early adopters are already beginning to test the integration of multimodality into their mobile services. One of the very first trials is by French mobile operator Bouygues Telecom. While it's difficult to estimate when multimodality will become commonplace, it's clear that multimodal user interfaces will revolutionize the use of mobile handhelds—just as graphical user interfaces revolutionized the use of the PC.