he proliferation of handheld devices means that users are increasingly relying on them to perform routine, daily tasks?the cell phone, with its video card and internet access is becoming a constant companion. With this ubiquity comes the necessity to make these devices smaller, more convenient to carry, and that means smaller display and keyboard screens.
“The age of bulky 3G handsets is over,” says Hiroshi Nakaizumi, Sony-Ericsson’s Head of Design in its latest press release touting the K600 UMTS handset, which, despite the fact that it weighs no more than your average 2G handset, still delivers video telephony, a 1.3 MegaPixel camera, and high-performance download capabilities.
Phones just keep getting smaller and their capabilities keep getting more complex. In order to facilitate increasingly complex levels of interaction with devices that keep getting smaller, you’re going to need to learn how to equip their mobile applications with input modes besides a keypad or a stylus. You’re going to need to learn about multimodal development.
Multimodal applications allow users to interact with a device through more than one input mode simultaneously. They allow input through speech, a keyboard, keypad, mouse and/or stylus?even motion sensoring?and allow output through synthesized speech, audio, plain text, motion video, and/or graphics.
Kirusa’s Voice SMS (KV.SMS) is a typical example of a multimodal application, integrating voice messaging with text-based SMS and multimedia-based MMS. Using this program, users can dictate and send SMS messages using only their voices, send voice messages to phones without actually ringing the phone, click on an SMS message to hear a voice message, or respond to a voice or SMS message by voice or text. Ultimate convenience.
Another typical example of a multimodal application is the Ford Model U SUV. This car’s multimodal interface uses speech technology to allow drivers to control navigation, make phone calls, operate entertainment features such as the radio or an MP3 player, and adjust the climate control, the retractable roof, and personalize preferences.
The Sony K600 UTMS is representative of future handheld devices.
How Do They Do That?
The widespread adoption of XML and derivative markup languages has, for all intents and purposes, enabled the advent of multimodal development. The existence of an independent translator for stored data frees developers from having to develop for specific devices. XML and, most significantly, VoiceXML make it remarkably easy for developers to create flexible interfaces with which to access varying clients.
The three building-block languages for multimodal development are: SALT (Speech Application Language Tags), X+V (XHTML + Voice), and EMMA (Extensible MultiModal Annotation). All three have been submitted to the W3C for consideration as standards for telephony and/or multimodal applications. Currently, all three are under consideration for the next version of VoiceXML.
SALT: This language is an extension of HTML and other markup languages (cHTML, XHTML, WML). It’s used to add speech interfaces to Web pages and it’s designed for use with both voice-only browsers and multimodal browsers?meaning, cellular phones, tablet PCs, and wireless PDAs.
Microsoft developed SALT specifically to enable speech across a wide range of devices and to allow telephony and multimodal dialogs. Because SALT uses the data models and execution environments of its host environments (HTML forms and scripting), it is more familiar to Web developers. Its event-driven interaction model is useful for multimodal applications.
However, SALT is merely a set of tags for specifying voice interaction that can be embedded into other “containing” environments. Because of this dependency on an external environment, developers using SALT may need to generate differing versions of an application for each device?for instance, an application for use on cell phones will require separate versions for Nokia and Motorola phones.
X + V: This IBM-sponsored language combines XHTML with VoiceXML 2.0, the XML Events module, and a third module containing a small number of attribute extensions to both XHTML and VoiceXML. This allows VoiceXML (audio) dialogs and XHTML (text) input to share multimodal input data.
The fact that X+V is built using previously standardized languages makes it easy to modularize?that is, to break apart its code into modes, where one mode is for speech recognition, one is for motion recognition, etc..
But using the XML Events standard is what really differentiates X+V from SALT. Whereas events drive the creation of X+V, thus defining the environment, SALT merely attaches its tags to events within a pre-existing environment. Because X+V is self-sufficient in this manner, applications written with it are generally more portable.
EMMA: This language was developed in order to provide semantic interpretations for speech, natural language text, keyboard/, and ink input (a type of stylus input that includes handwriting recognition).
EMMA is a complimentary language to SALT and X+V, functioning as a sort of middleman between a multimodal application’s components?that is, between a user’s input and the X+V- or SALT-based interpreter. This frees developers from having to worry about writing code to interpret user input. EMMA simply translates input into a format interpreted by the application language, greatly simplifying the process of adding multiple modes to an application.
More Bells and Whistles: Who Cares?
Back when basic wireless technology began to take off, analysts, journalists, and vendors alike predicted it would change the way everybody did business, and it hasn’t really?except for those who were mobile in the first place (sales people, UPS drivers, etc.). Multimodality is interesting to be sure, but should we realistically expect it to have a broad-reaching impact on the typical enterprise developer? In other words, you’ll soon be able to tell your iPod to play a certain song while driving in your car without taking your hands off the wheel. So what?
To get our arms around that question, we need to look a little closer at the way mobilized application development proliferated. Earlier this year at SpeechTEK, Intel’s Peter Gavalakis outlined three reasons for the lack of wireless adoption in the enterprise: cost, lack of infrastructure, and lack of standardization.
In fact, it is the cost of standardization and the cost of infrastructure that prevented wireless penetration in the enterprise.
In order to understand why this is so, it’s important to look at wireless adoption outside the United Sates. While wireless adoption in the enterprise in the United States has been slow, internationally, it has not. “Poorer countries have a higher wireless penetration,” says Former W3C Multimodal Working Group member and EMMA author, Roberto Pieraccini. This is because poorer countries didn’t have the money for traditional wired infrastructures in the first place.
In countries such as China (which is the second largest mobile market in the world), mobility and multimodality have been adopted rather quickly. Obviously, it is more cost effective for a country to develop global satellite systems in order to accommodate a wireless business culture than develop a wired infrastructure at this late date.
However, greater wireless penetration is not limited to the poorer countries. “Last year,” says Pieraccini, “cell phones outnumbered landlines in Europe.” What’s the reason for this? “Europe adopted the GSM standard,” says Pieraccini, whereas in the United States, phone manufacturers use different standards, making many phones un-interoperable. Thus, each company has an interest in seeing a standard adopted only if it’s the standard they currently use.
Not to worry, theorizes Pieraccini, who invokes Wi-Fi as a potentially “disruptive technology,” capable of eliminating these cost, standardization, and infrastructure issues altogether. If Wi-Fi allows you to use the Internet and VoIP, to talk to anyone you want anywhere for only cents a minute, why do you need your telephone or your cell phone?
The Future of Multimodality
If multimodality can render your phone and your cell phone obsolete, the barriers to total wireless penetration disappear. Will such an event prompt the mobilization phenomenon to finally impact the enterprise with crushing urgency?
“We don’t know, we’re still in the early adoption phase,” says Pieraccini, speaking of multimodal adoption within the framework of Geoffrey Moore’s Chasm Theory. Essentially, the Chasm theory states that the technology adoption life cycle is different than in other adoption cycles, due to a “chasm” between the early adopters of the product (the technology enthusiasts and visionaries) and the early majority (the pragmatists). Early adopters may embrace a technology, but that technology may never cross the chasm to the early majority. Multimodal applications have yet to demonstrate significant appeal to the early majority.
Perhaps the reason that wireless technology?and thus multimodal technology?development has had such a hard time trenching first through the youth and niche markets, is because wireless and multimodal have generated their own, correlative chasm theory. The chasm, in this instance, is not one of market awareness, but of developer knowledge and confidence. Because this type of technology is so complicated, projects that “start big usually fail,” says Pieraccini. This warns developers to begin with small, uncomplicated applications?a simple voice recognition app that allows you to select a ring tone, for instance.
SALT, X+V, and EMMA are three nascent, sometimes complementary, languages, all looking to be standardized in the multimodal area, and they’re good places to start. Fear of a lack of standardization is not a reason to avoid getting familiar with the other aspects of multimodal development: application design, architecture, and testing are all aspects of programming that won’t change, even if the language you choose becomes obsolete. When it comes time to develop multimodal applications for global deployment, you’ll need to know which languages comply with which standards, where, and on what devices.
It’s important to remember that mobile and multimodal development have taken off in a big way in the global marketplace. And this trend, combined with the capabilities provided by XML abstraction, has made multiple inputs an obvious destination for a wide range of applications.