he merging of computing with our everyday life?through computers, personal digital assistants, cell phones, and a plethora of other gadgets?is driving a trend towards pervasive computing, in which computing is becoming a backdrop to our daily activities. IBM, long a leader in the pervasive computing arena, is one of three firms driving a standard to support multimodal interfaces to mobile devices through the XHTML+Voice markup language, an XML application that leverages XHTML and VoiceXML to provide voice-and-video interfaces to Web applications from desktop and handheld computers. Using tools such as the IBM Multimodal toolkit, you can learn how to extend your application to include voice control and output using a multimodal browser such as the ones available from Access and Opera.
These multimodal interfaces let you differentiate your application from others in two keys ways. First, your application can accept and parse voice input in Web forms, meaning that a user can use your application without needing to resort to a keyboard, mouse, or stylus. Second, your application can provide key pieces of information including warnings, prompts, and search results using voice synthesis, which adds an additional dimension to the presentation of results. Because these technologies leverage XML and the traditional Web browsing paradigm, it’s easy for you to work these features into your Web-based application. There’s very little overhead involved and you needn’t build a sophisticated client-side application using a native platform’s (such as Microsoft Windows or Mac OS X) support for voice recognition and synthesis.
The IBM Multimodal toolkit requires Windows 2000 or later, and a copy of IBM WebSphere Site Studio or IBM WebSphere Application Studio 5.1.1. Installation is through a clickable installer available from IBM’s Pervasive Computing here.
As you install the IBM Multimodal Toolkit, you also have the option to install one or more multimodal browsers from Access and Opera for testing your application. You should definitely install one or the other (or both!) and also consider downloading a handheld version of the same browser from the IBM Pervasive Computing for your target mobile device (at this point, both Microsoft PocketPC and Sharp Zaurus are supported).
Behind Multimodal Web Applications
With the advent of VoxML (by Motorola) and VoiceXML (a W3C standard), voice applications were some of the first applications to leverage the ubiquity of XML to build speech-oriented, Web-enabled applications. The XHTML+Voice standard?often called simply X+V, a practice I’ll continue here?uses the modular nature of XML to define a markup language suitable for text and voice, including the following modules:
- XHTML Basic, which provides a grammar for basic text formatting facilities including type face selection and common stylistic formatting options?including bullet, numbered, and definition lists.
- XML Events, which provide a grammar for managing incoming events and how they interact with voice-interaction behaviors.
- Voice XML modules provide a grammar for speech-enabling XHTML.
- An additional, new X+V extension integrates the voice and visual features of the other modules.
All X+V applications use XHTML+Voice as their markup language, and must include the following preamble:
If you’re a seasoned XML developer, this won’t give you pause, but I’d like to run through it anyway, because it showcases a key feature of XML that’s not used as often as it should be: namespaces. As in other programming environments, XML supports namespaces so that an XML document can include pieces of other XML definitions with the same name. As the XML shows, X+V documents draw from three disparate namespaces (look at the html tag after the XML !DOCTYPE preamble):
- The XHTML namespace: XML tags without a namespace prefix are XHTML tags.
- The XML Event namespace: XML tags with a namespace prefix ev: are XML event tags.
- The VoiceXML namespace: XML tags with a namespace prefix vxml are VoiceXML tags.
It’s often easiest to start with your site’s visual content, and only after it’s complete incrementally add the voice content. Doing this lets you play from your strength?existing knowledge of XHTML and the problem domain?and after you get the easy stuff out of the way, you can iterate over the voice interface until it’s perfect.
A Sample Application
Let’s take a simple example, a Web application to provide simple weather reports. The baseline XHTML for the location prompt for this application is in Listing 1. It’s a very simple form, which prompts for either the city and state or the zip code of the desired location, returning the content to the server-side Java page submit.jsp. You can see how the page will appear in Figure 1.
|Author’s Note: The server-side code doesn’t interest us in this article, because all of the multimodal interface work is being performed on the client-side. If you’re interested in seeing a server-side voice application, see my previous article “Creating Voice Applications Using VoiceXML and the IBM Voice Toolkit“.|
Once you create the XHTML?which you can do by hand or using your favorite Web authoring tools?the next thing to do is start adding the voice interface. This resides in the element of your document, giving you a primitive way to separate your content from its presentation.
|Figure 1. How’s the Weather? This is the entry page for the sample weather application.|
It’s easiest to add the voice content within WebSphere Studio using the Multimodal Toolkit. To add the voice interface, in this example, you must:
- Open the X+V file in the WebSphere Studio Editor.
- Position the cursor where you want the editor to place the X+V content, at the end of the block (I like to insert a blank line or two to keep things readable around the tags I insert).
- Place the VoiceXML tag by pressing control-space and choosing from the Content Assist menu.
- Name the VoiceXML tag by giving it an id, so the new tag now reads .
- Use the Multimodal Toolkit’s Reusable Dialog Wizard (right-click the source editor and choose the wizard) to select the usamajorcity item.
- Edit the resulting to insert the response in the city field of the form by changing the first
tag to ‘city’ from ‘VARusmajorcityUSMajorCity.’
After this sequence of events, your element looks like this:
X+V Weather Demonstration
Note that the voice toolkit has inserted a reference to a pre-built dialog provided by IBM, rather than the dialog itself. It has also inserted some additional code you don’t need to return the utterance as well as the interpreted speech to the server through the last