devxlogo

Creating Voice Applications Using VoiceXML and the IBM Voice Toolkit

Creating Voice Applications Using VoiceXML and the IBM Voice Toolkit

‘ll be honest. I’ve always been a sucker for voice-navigated applications. At the first startup I worked at, two of us were given Macintosh Quadras as development workstations, and we spent more office time the first week playing with the speech recognition engine than actually cutting code. A good voice application (which admittedly the Finder running on the Quadra wasn’t!) is a dream to use, freeing users from the traditional ball-and-chain of a keyboard and monitor. Unfortunately, until recently, the tools to build such an application were out of reach of all but a few. The IBM alphaWorks Voice Toolkit preview puts professional tools for developing voice applications in the hands of every developer, through evaluation or professional licenses of the Rational Software Development Platform and WebSphere Studio.

Installation
Installation is straightforward, albeit slow, unless you’re already running the Rational and WebSphere Studio suites. Your development workstation must be running Windows 2000 or Windows XP, and in addition to downloading the IBM Voice Toolkit, you will need to download evaluation or professional versions of both the Rational and WebSphere suites?a minimal install can span almost two gigabytes. Installing the tool chain is a multi-step process; the download consists of images combined by an extraction program which creates CD images, from which you then install the necessary tool chain and finally the IBM Voice Toolkit. Mercifully, you do not have to burn the CD images to disk first!

What You Can Do with the IBM Voice Toolkit
The toolkit is actually everything you need to get a prototype of a voice application up and running, from a front-end call simulator which lets you emulate incoming calls from a call center, to tools for editing grammar and application flow, all packaged as plug-ins for the Rational Integrated Development environment. Once you finish the prototyping, you can either transition to a production-ready platform using WebSphere Studio and the WebSphere Voice Response system, used to answer incoming calls from users and optionally originate outgoing calls to users.

The actual application development process varies, depending on the kind of voice application you’re creating, but relies heavily on your knowledge of VoiceXML and integration with your back-end Web infrastructure using Java and HTTP.

Introducing VoiceXML
While you’re downloading and processing the installer images, it’s a good time to brush up?or learn?VoiceXML, the XML application at the heart of applications developed with the toolkit. VoiceXML is at the heart of the user interface for voice applications, much as the widget library of choice is at the heart of a traditional GUI. Written in XML, the syntax should be familiar to you?all you need to know are the tags used in VoiceXML. Here’s a simple example:

 
Hello, world!

The structure of this document should be familiar. After the XML preamble, which specifies the XML version, the encoding scheme, and the document type (which is a VoiceXML 2.0 document meeting the 2.0 DTD generated by the IBM Voice Toolkit), the document itself follows. This document consists of a single form?the top-level entity in VoiceXML. This form has a single block, a spoken segment that does not require user input.

In the VoiceXML paradigm, your user interface is modeled as a finite state machine; inputs and outputs are individual states, expressed as forms. Some forms, like the one in the previous example, are output-only, directing the application to speak to the user. Others are input/output forms, with fields that the user populates through speech (called utterances in VoiceXML documentation). Forms can be named, and the execution through a path of forms can be driven by the VoiceXML content itself, as you can see from Listing 1.

Listing 1’s admittedly an artificial example?short of a talking vending machine or the matter generator in Star Trek, there’s little use for a speech interface to serve coffee drinks?but it points to many of the key aspects of VoiceXML programming.

Starting at the top of the listing, you see how to declare variables of global scope using the tag. Note that when setting a variable, if you want to specify a literal value, you must include it in single quotes. Thus, the skipintro variable is being set to the string ‘play’?if I omit the single quotes, it instead sets skipintro to the value of the variable play.

The next two blocks consist of top-level links, which manage utterances that you can use at any point in the application. The first indicates that if you say either “Menu” or “Start over,” the application will restart. The second indicates that if you say “Goodbye” or “Exit” that the application will exit.

The application itself consists of a series of forms. Navigation between the forms occurs using the goto tag, which simply references the inline name of the destination form. The goto tag can also reference an entirely separate document; you’d simply specify the URL of the destination document; you can use this mechanism to chain to other VoiceXML documents or trigger server-side scripts that return new dynamic VoiceXML content. Note that the first form uses an if-then construct to skip the introductory text if needed, such as when the top-level “Start Over” action is taken.

The second form takes a single input, the kind of beverage you want to order. This is an example of the use of a prompt tag, which causes the application to prompt the user for input and pause until it’s received. You must accompany a prompt tag with a grammar tag that indicates valid responses to the query; the application server uses these to tune the recognizer and determine the appropriate course of action. The grammar in Listing 1 is simple but representative; it outlines a series of responses and uses the tag tag to indicate to what class a specific response belongs (i.e., chai is a type of tea). This use of tags can be very helpful in applications where responses are really selecting types of things, or to map a group of synonyms to a single response.

Prompt tags would be useless without the ability to respond to user input. The filled tag, in conjunction with an if-then tag, provides you with a way to act on the user’s response to the prompt. This tag lets you set variables or document properties to the class of the response, or the actual value of the utterance made by the user. You can also execute conditionals based on these values, selecting the next form to be played based on the content of the utterance or the class of the response.

The final forms process the selection you made from the menu prompt, and show you how to include the value of a VoiceXML variable in the context of a body or prompt by using the value tag by specifying the variable to evaluate.

VoiceXML has many other facets beyond what can be covered here. For example, you can imbed references to specific recorded sound samples?such as alert tones or prerecorded speech?to be played during specific states of your application. You can also imbed pieces of Speech Synthesis Markup Language (SSML) within your VoiceXML application, letting you fine-tune the pronunciation and emphasis of specific voice prompts. And, of course, VoiceXML is fully internationalized; its implementation in the IBM Voice Toolkit supports most of the world’s major languages for industrialized nations.

Integrate Your VoiceXML and Your Existing Services with Java
The combination of VoiceXML and an application server like IBM’s is interesting, but it’s not the whole story. With VoiceXML and server-side Java, you can do a lot?building exactly the same sorts of applications you build today using XHTML and server-side scripting. The WebSphere Voice Response API takes things further, letting you initiate calls and perform programmatic actions hard-to-do with VoiceXML and server-side scripting alone.

The Voice Response API is based around the notion of a voice application, encapsulated in a WVRApplication class. This class has its own entry point, voiceMain, from which you can determine the characteristics of the current connection with an end user through a Call object. You also have access to a WVR (presumably this stands for WebSphere Voice Response) object, which lets you make and receive calls and handle individual voice segments. In fact, it’s entirely possible to code an entire application using just the Voice Response API and the WebSphere Voice Response System, but you really shouldn’t do that; using VoiceXML to encapsulate as much of your user interface as possible makes localization and extension much easier?just as separating a Web site’s style directives from its data does. In point of fact, as you look at the WebSphere Voice Response API, it becomes pretty clear that the API itself is either a wrapper around the internals of the Voice Response System, or the significant parts of the foundation of the WebSphere Voice Response system, depending on your point of view.

Use of the Voice Response API is fairly simple and clearly documented; the voice toolkit has some excellent tutorials that walk you through the gamut of interfaces available. Where the Voice Response API shines is when you must integrate a Web application with outgoing calls, such as database triggers. For example, an outside plant management application might use a database trigger and the Voice Response API to call the cell phone of a maintenance worker when a failure is detected in an automated system, and then actually describe the failure over the call.

A Soup-to-nuts Environment
The IBM Voice Toolkit preview is interesting not just in what it offers through its support of the latest standards in voice applications, but in its integration with a world-class development and deployment platform. It provides a soup-to-nuts environment for building voice applications, with plenty of help along the way (for example, there are graphical editors to ease the writing of the grammar segments of your VoiceXML, and a way to execute a VoiceXML file and interact with it right from the Rational IDE). It’s an excellent way to develop and deploy a voice application, or, if you’re an independent developer curious to find out what really happens when you call your local credit card company, build a voice application prototype of your own.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist