Home » SALT or VoiceXML For Speech Applications?

SALT or VoiceXML For Speech Applications?

ALT and VoiceXML are both markup languages for writing applications that use voice input and/or output. Both languages were developed by industry consortia (the SALT Forum and the VoiceXML Forum, respectively), and both were contributed to W3C as part of their ongoing work on speech standards.

So why two specifications? Mainly because they were designed to address different needs, and they were designed at different stages in the life cycle of the Web. VoiceXML arose out of a need to define a markup language for over-the-telephone dialogs?Interactive Voice Response, or IVR, applications?and at a time (1999) when many pieces of the Web infrastructure as we know it today had not matured. SALT arose out of the need to enable speech across a wider range of devices, from telephones to PDAs to desktop PCs, and to allow telephony (voice-only) and multimodal (combined voice and visual) dialogs. SALT was also designed at a time (2002) when many key Web technologies have become well-established (XML, DOM, XPath, etc.).

I will declare my interest here: I represent Microsoft in the SALT Forum’s Technical Working Group. However, I have studied SALT and VoiceXML in depth, and will use this forum to take an objective look at the two specifications, and point out the main technical differences between them in an unbiased way. You can decide for yourself which specification is most suitable for your applications. (See Sidebar: Developer Communities)How Do They Work?
SALT focuses on the speech interface, defining a small set of XML elements which are used inside a “host” page of markup, such as XHTML, HTML + SMIL, WML, etc. SALT elements expose a DOM interface, which places them at the disposal of the execution environment of the host markup. So speech input and output is controlled by developer code in whatever environment is supported by the host page, e.g. the scripting module in HTML pages, SMIL 2.0, and so on. Web functionality is also handled by the host page, so page navigation and form submission are written as usual in HTML. SALT also contains built-in declarative mechanisms intended for use in less rich device profiles. SALT’s feature set is kept low-level, to allow flexibility of interactional logic and fine-grained control of the speech interface.

VoiceXML provides a larger set of XML elements, since it is intended as a complete, standalone markup. Hence, VoiceXML includes tags for data (forms and fields), control flow, and Web functionality. Speech input and output is controlled by VoiceXML’s dedicated execution environment: the Form Interpretation Algorithm (FIA), and ECMAScript can be used at certain points within the page to direct flow. Again, simple dialogs can also be written in a declarative manner. VoiceXML’s feature set is at a higher level, encompassing Web functionality and dialog flow. This allows VoiceXML pages to be used alone, and elementary dialogs to be built rapidly by the novice developer.

SALT Dialog Flow Example
Since SALT elements are DOM objects, they expose an interface of properties, events and methods, and can be manipulated accordingly inside the page. Activation will typically follow the event wiring model familiar to many HTML Web developers. A element in SALT, for example, exposes an interface which includes the following features:

id 		property to identify the object;Start() 	method to begin playback;oncomplete 	event thrown when playback is complete;

Similarly, the element is a basic building block of speech recognition. It also has an id and a Start() method, as well as the following:

 	a grammar to recognize speech recognition input 	directive to bind the user's response into a control on the page. onreco	event thrown on a successful recognition.

This allows code such as the following HTML and SALT fragment:

     	   
         Welcome to my speech recognition application.              Please say your password.

This sample plays a simple welcome prompt (sayWelcome), then asks for a password (askPassword) and simultaneously activates the element named recoPassword. When recognition is successful, the bind copies the response into the iptPIN textbox, and the onreco event handler submits the HTML form to the Web server.

The example shows simple event wiring for interactional flow. For more complex SALT dialogs, you would probably use script functions and reusable blocks of code across SALT pages and applications. But script isn’t always necessary: another way to activate prompts and listen elements would be to use SMIL 2.0 (see the SALT specification for an example), or, on small devices, the declarative mechanisms available through data and event binding.

	1	Points By Player	Player Name	Points	Points Scored

VoiceXML Dialog Flow Example
In contrast, VoiceXML applies its own page interpretation mechanisms (the Form Interpretation Algorithm, or FIA) and programmatic elements to conduct dialog flow. A

is composed of or , and contain and/or . Navigation from

and page to page is coded by elements. Navigation within the form is provided by the FIA which ‘visits’ the fields individually until they contain values. Processing blocks are available at certain points in execution. For example, the element inside a

or is used to say what to do when the form is complete or the field has a value. Navigational manipulation can be effected by snippets of ECMScript inside cond attributes (condition) on certain elements, or in conditional elements , , etc. in the processing blocks.

The interaction is accomplished in VoiceXML using the following code (all the other functionality is the same as in the SALT example):

               Welcome to my speech recognition application.            	 	    	     Please say your password.

As you can see, in contrast to the explicit event-driven model of SALT, VoiceXML uses an implicit page execution model. When forms contain more than a single field, VoiceXML’s FIA allows the writing of elementary dialogs in a largely declarative manner. VoiceXML also contains a mechanism which can be useful for embedding a form-filling dialog from one page inside another.

Analysis
The differences shown in these examples result largely from the goals of each markup.

In VoiceXML, the

provides a unit which contains both the data model (the fields) and a built-in way to navigate the model (the Form Interpretation Algorithm, or FIA). This allows you to build form-filling dialogs which follow a ‘system-initiative’ control model, that is, dialogs for which the system prompts the user for every piece of information. The FIA also allows a degree of simple ‘mixed initiative’ control, where the user is a little freer to provide extra information when the form is first visited. This is useful in the initial design stages of IVR-style telephony dialogs.

In SALT, you use the data model and execution environment of the host environment (eg HTML forms and scripting). This is typically more familiar to the today’s Web developer. It also provides a flexible way to write and tune dialogs, so that complex dialogs, including mixed-initiative and user-initiative dialogs, are firmly under developer’s control. An event-driven interaction model is also generally more useful for multimodal applications.

Charlie Frank

Charlie has over a decade of experience in website administration and technology management. As the site admin, he oversees all technical aspects of running a high-traffic online platform, ensuring optimal performance, security, and user experience.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.