RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Creating Voice Applications Using VoiceXML and the IBM Voice Toolkit : Page 2

Are you looking to create state-of-the-art, voice-driven applications? Look no further than to IBM; the latest iteration of the IBM Voice Toolkit integrates with the Rational Software Development Platform, giving you a turnkey development environment based on industry standards, including VoiceXML and Java.

Introducing VoiceXML
While you’re downloading and processing the installer images, it's a good time to brush up—or learn—VoiceXML, the XML application at the heart of applications developed with the toolkit. VoiceXML is at the heart of the user interface for voice applications, much as the widget library of choice is at the heart of a traditional GUI. Written in XML, the syntax should be familiar to you—all you need to know are the tags used in VoiceXML. Here's a simple example:

<?xml version="1.0" encoding="iso-8859-1"?> 
<!DOCTYPE vxml PUBLIC "-//W3C//DTD VOICEXML 2.0//EN" "vxml20-1115.dtd">
<vxml version="2.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/vxml">
<meta name="GENERATOR" content="IBM Voice Toolkit for WebSphere Studio"/>
  Hello, world! 
The structure of this document should be familiar. After the XML preamble, which specifies the XML version, the encoding scheme, and the document type (which is a VoiceXML 2.0 document meeting the 2.0 DTD generated by the IBM Voice Toolkit), the document itself follows. This document consists of a single form—the top-level entity in VoiceXML. This form has a single block, a spoken segment that does not require user input.

In the VoiceXML paradigm, your user interface is modeled as a finite state machine; inputs and outputs are individual states, expressed as forms. Some forms, like the one in the previous example, are output-only, directing the application to speak to the user. Others are input/output forms, with fields that the user populates through speech (called utterances in VoiceXML documentation). Forms can be named, and the execution through a path of forms can be driven by the VoiceXML content itself, as you can see from Listing 1.

Listing 1's admittedly an artificial example—short of a talking vending machine or the matter generator in Star Trek, there's little use for a speech interface to serve coffee drinks—but it points to many of the key aspects of VoiceXML programming.

Starting at the top of the listing, you see how to declare variables of global scope using the <var/> tag. Note that when setting a variable, if you want to specify a literal value, you must include it in single quotes. Thus, the skipintro variable is being set to the string 'play'—if I omit the single quotes, it instead sets skipintro to the value of the variable play.

The next two blocks consist of top-level links, which manage utterances that you can use at any point in the application. The first indicates that if you say either "Menu" or "Start over," the application will restart. The second indicates that if you say "Goodbye" or "Exit" that the application will exit.

The application itself consists of a series of forms. Navigation between the forms occurs using the goto tag, which simply references the inline name of the destination form. The goto tag can also reference an entirely separate document; you'd simply specify the URL of the destination document; you can use this mechanism to chain to other VoiceXML documents or trigger server-side scripts that return new dynamic VoiceXML content. Note that the first form uses an if-then construct to skip the introductory text if needed, such as when the top-level "Start Over" action is taken.

The second form takes a single input, the kind of beverage you want to order. This is an example of the use of a prompt tag, which causes the application to prompt the user for input and pause until it's received. You must accompany a prompt tag with a grammar tag that indicates valid responses to the query; the application server uses these to tune the recognizer and determine the appropriate course of action. The grammar in Listing 1 is simple but representative; it outlines a series of responses and uses the tag tag to indicate to what class a specific response belongs (i.e., chai is a type of tea). This use of tags can be very helpful in applications where responses are really selecting types of things, or to map a group of synonyms to a single response.

Prompt tags would be useless without the ability to respond to user input. The filled tag, in conjunction with an if-then tag, provides you with a way to act on the user's response to the prompt. This tag lets you set variables or document properties to the class of the response, or the actual value of the utterance made by the user. You can also execute conditionals based on these values, selecting the next form to be played based on the content of the utterance or the class of the response.

The final forms process the selection you made from the menu prompt, and show you how to include the value of a VoiceXML variable in the context of a body or prompt by using the value tag by specifying the variable to evaluate.

VoiceXML has many other facets beyond what can be covered here. For example, you can imbed references to specific recorded sound samples—such as alert tones or prerecorded speech—to be played during specific states of your application. You can also imbed pieces of Speech Synthesis Markup Language (SSML) within your VoiceXML application, letting you fine-tune the pronunciation and emphasis of specific voice prompts. And, of course, VoiceXML is fully internationalized; its implementation in the IBM Voice Toolkit supports most of the world's major languages for industrialized nations.

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date