Creating Multimodal Applications Using the IBM Multimodal Toolkit

Creating Multimodal Applications Using the IBM Multimodal Toolkit

he merging of computing with our everyday life?through computers, personal digital assistants, cell phones, and a plethora of other gadgets?is driving a trend towards pervasive computing, in which computing is becoming a backdrop to our daily activities. IBM, long a leader in the pervasive computing arena, is one of three firms driving a standard to support multimodal interfaces to mobile devices through the XHTML+Voice markup language, an XML application that leverages XHTML and VoiceXML to provide voice-and-video interfaces to Web applications from desktop and handheld computers. Using tools such as the IBM Multimodal toolkit, you can learn how to extend your application to include voice control and output using a multimodal browser such as the ones available from Access and Opera.

These multimodal interfaces let you differentiate your application from others in two keys ways. First, your application can accept and parse voice input in Web forms, meaning that a user can use your application without needing to resort to a keyboard, mouse, or stylus. Second, your application can provide key pieces of information including warnings, prompts, and search results using voice synthesis, which adds an additional dimension to the presentation of results. Because these technologies leverage XML and the traditional Web browsing paradigm, it’s easy for you to work these features into your Web-based application. There’s very little overhead involved and you needn’t build a sophisticated client-side application using a native platform’s (such as Microsoft Windows or Mac OS X) support for voice recognition and synthesis.

The IBM Multimodal toolkit requires Windows 2000 or later, and a copy of IBM WebSphere Site Studio or IBM WebSphere Application Studio 5.1.1. Installation is through a clickable installer available from IBM’s Pervasive Computing here.

As you install the IBM Multimodal Toolkit, you also have the option to install one or more multimodal browsers from Access and Opera for testing your application. You should definitely install one or the other (or both!) and also consider downloading a handheld version of the same browser from the IBM Pervasive Computing for your target mobile device (at this point, both Microsoft PocketPC and Sharp Zaurus are supported).

Behind Multimodal Web Applications
With the advent of VoxML (by Motorola) and VoiceXML (a W3C standard), voice applications were some of the first applications to leverage the ubiquity of XML to build speech-oriented, Web-enabled applications. The XHTML+Voice standard?often called simply X+V, a practice I’ll continue here?uses the modular nature of XML to define a markup language suitable for text and voice, including the following modules:

  • XHTML Basic, which provides a grammar for basic text formatting facilities including type face selection and common stylistic formatting options?including bullet, numbered, and definition lists.
  • XML Events, which provide a grammar for managing incoming events and how they interact with voice-interaction behaviors.
  • Voice XML modules provide a grammar for speech-enabling XHTML.
  • An additional, new X+V extension integrates the voice and visual features of the other modules.

All X+V applications use XHTML+Voice as their markup language, and must include the following preamble:


If you’re a seasoned XML developer, this won’t give you pause, but I’d like to run through it anyway, because it showcases a key feature of XML that’s not used as often as it should be: namespaces. As in other programming environments, XML supports namespaces so that an XML document can include pieces of other XML definitions with the same name. As the XML shows, X+V documents draw from three disparate namespaces (look at the html tag after the XML !DOCTYPE preamble):

  • The XHTML namespace: XML tags without a namespace prefix are XHTML tags.
  • The XML Event namespace: XML tags with a namespace prefix ev: are XML event tags.
  • The VoiceXML namespace: XML tags with a namespace prefix vxml are VoiceXML tags.

It’s often easiest to start with your site’s visual content, and only after it’s complete incrementally add the voice content. Doing this lets you play from your strength?existing knowledge of XHTML and the problem domain?and after you get the easy stuff out of the way, you can iterate over the voice interface until it’s perfect.

A Sample Application
Let’s take a simple example, a Web application to provide simple weather reports. The baseline XHTML for the location prompt for this application is in Listing 1. It’s a very simple form, which prompts for either the city and state or the zip code of the desired location, returning the content to the server-side Java page submit.jsp. You can see how the page will appear in Figure 1.

Author’s Note: The server-side code doesn’t interest us in this article, because all of the multimodal interface work is being performed on the client-side. If you’re interested in seeing a server-side voice application, see my previous article “Creating Voice Applications Using VoiceXML and the IBM Voice Toolkit“.

Once you create the XHTML?which you can do by hand or using your favorite Web authoring tools?the next thing to do is start adding the voice interface. This resides in the element of your document, giving you a primitive way to separate your content from its presentation.

Figure 1. How’s the Weather? This is the entry page for the sample weather application.

It’s easiest to add the voice content within WebSphere Studio using the Multimodal Toolkit. To add the voice interface, in this example, you must:

  1. Open the X+V file in the WebSphere Studio Editor.
  2. Position the cursor where you want the editor to place the X+V content, at the end of the block (I like to insert a blank line or two to keep things readable around the tags I insert).
  3. Place the VoiceXML tag by pressing control-space and choosing from the Content Assist menu.
  4. Name the VoiceXML tag by giving it an id, so the new tag now reads .
  5. Use the Multimodal Toolkit’s Reusable Dialog Wizard (right-click the source editor and choose the wizard) to select the usamajorcity item.
  6. Edit the resulting to insert the response in the city field of the form by changing the first tag to ‘city’ from ‘VARusmajorcityUSMajorCity.’

After this sequence of events, your element looks like this:

    X+V Weather Demonstration                                                                                       

Note that the voice toolkit has inserted a reference to a pre-built dialog provided by IBM, rather than the dialog itself. It has also inserted some additional code you don’t need to return the utterance as well as the interpreted speech to the server through the last tag. You can choose to comment this out or remove it altogether, unless you’re doing work with a recognizer on the back end (or want to log utterances somewhere in order to investigate complaints about missed recognition events, handy during field tests.) The wizard will have also added text form elements to the document’s form, which you’ll want to remove; you’ll find those in the

block in the document’s body.

You’ve now specified the voice equivalent of an XHTML form element, using the predefined voice form element provided by IBM. The only remaining work is to link the two, so that when the city field has focus, the VoiceXML form element is active. You create this link using XML Events. You can learn more about XML Events here. The event your forms must watch for is the focus event, which the browser provides when its focus changes from one input to another. Each event must also have a handler, which indicates what should be active when the client triggers the event. The XML event handlers are bound to the HTML element which should be associated with the event’s generation. Therefore, you link the text input to the voice form input in the input element, like this:


You can see the final bits of code in Listing 2.

The Key Benefits of the IBM Multimodal Toolkit
As you’ve just seen, the IBM Multimodal toolkit provides several advantages over hand-coding your X+V interfaces. First, having multimodal browsers in which to test your code is priceless. Of course, you could download just the appropriate handheld client for your work, but having to switch between your development workstation and your handheld for each bit of testing and debugging is a real chore?and the only other option would be going entirely without. Driving at night in the fog without headlights isn’t my idea of fun and neither is debugging a Web application without a client on my development machine!

Aside from giving you an excellent test tool, the IBM Multimodal toolkit also provides an excellent collection of wizards to speed the coding of common tasks, such as specifying specific kinds of forms input and often-used grammars. Key among these are the prebuilt snippets of X+V code, such as tested dialog components for entering addresses, credit card numbers, URLs, email addresses, and so on. Each of us, as developers, build a library of snippets for such things; why not leverage the library built by a leader in the field?

By continuing to lead and leverage global standards, the IBM Multimodal toolkit is an excellent way to get your feet wet in writing multimodal applications, whether you’re about to deploy a mobile Web solution or just keeping current with the latest trends.


About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist