Defining the Application
|Figure 2. Types of Speech-based Applications: You select the type of speech-based application during the creation of a project.|
Both the SASDK and Microsoft Speech Server are designed to develop and support two distinct types of speech-based applications-voice only and multimodal. By default the developer selects the application type when they create a new project as shown in Figure 2
. The role of the SASDK is to provide a developer-based infrastructure to support both the development and debugging of either type on a local machine. On the other hand, the MSS is designed to provide a production-level server-side environment for deployed speech-based applications. Figure 3
shows a sample schematic of a production environment that includes Microsoft Speech Server.
Voice-only applications are designed to never expose a visible Web interface to end users. This type of speech application includes both traditional voice only applications and touchtone or Dual Tone Multi-Frequency (DTMF) applications. In either case, all interaction with the application is done by either voice only or keypad presses. The result is a menu option or selection based on the user's response. Once deployed, the Microsoft Speech Server includes two major components that are designed to support these types of applications in a production environment. The Telephony Application Service (TAS) is responsible for providing a voice only browser or SALT interpreter which is used to process the SALT markup generated by the ASP.NET speech-enabled Web application. Also, the Speech Engine Services (SES) that provides the speech recognition engine also handles the retrieval of the output generated by the application. Finally, the Telephony Interface Manager (TIM) component provides the bridge between the telephony board hardware which is connected to both the network and the TAS.
|Figure 3. Sample Schematic: The figure shows an example of what your production speech environment may contain.|
Multimodal applications, on the other hand, are designed to combine speech input and output with a Web-based graphical user interface. In a traditional Web-based GUI, the user directs the system through a combination of selections and commands. Each action is translated into a simple sentence that the system can execute. Fundamentally, each sentence contains verbs that act on a direct object. The selection of the mouse defines the direct object of a command, while the menu selection describes the action to perform. For example, by selecting a document and choosing print, the user is telling the computer to "Print this document." In multimodal systems, speech and mouse input are combined to form more complex commands.
|Both the SASDK and Microsoft Speech Server are designed to develop and support two distinct types of speech based applications-voice only and multimodal.|
Building Speech Applications
Like any Web-based application, speech applications have two major components-a Web browser component and server component. Realistically, the device that consumes the application will ultimately determine the physical location of the Speech Services engine. For example, a telephone or DTMF application will natively take advantage of the server-side features of Microsoft Speech Server. However, a desktop Web application will leverage the markup returned by MSS in conjunction with desktop recognition software and the speech add-ins for Microsoft Internet Explorer.
In addition to the default project template, the SASDK also installs a set of speech-enabled ASP.NET controls. By default these controls are added to the Visual Studio toolbox as shown in Figure 4.
|Figure 4. Installed Controls: Here's the set of speech controls installed by the SASDK into Visual Studio 2003.|
elements specified by the markup. Any additional client-side elements are invoked by calling the client-side start()
Once started, the Speech Services engine listens for input from the user when a <listen>
element is invoked. Once it receives the response audio or utterances it compares it's analysis of the audio stream to what is stored in the grammar file, looking for a matching pattern. If the recognizer finds a match a special type of XML document is returned. This document contains markup called Semantic Markup Language (SML) and is used by the client as the interpretation of what the user said. The client then uses this document to determine what to do next. For example, execute a <prompt>
element. The cycle repeats itself until the application is done or the session ends.
All ASP.NET speech controls are implemented in the framework namespace Microsoft.Speech.Web.UI. Within the namespace, these controls are categorized by their functions. By default, these categories are basic, dialog, application controls, and Call management controls. Call Management controls are an abstraction of the Computer Supported Telecommunications Applications (CSTA) messages you'll use in your application.
Like any other ASP.NET Web control, the speech controls are designed to provide a high level abstraction on top of the lower-level XML and script emitted during run time. Also, to make the implementation of these controls easier, each control provides a set of property builders as shown in Figure 5
The Basic Speech Controls
|Figure 5. Property Builders: Property builders are used to simplify the design of a speech application.|
The basic speech controls, which include Prompt and Listen, are designed to create and manipulate the SALT hierarchy of elements. These controls provide server-side functionality that is identical to the elements invoked during run time on the client. The Prompt control is designed to specify the content of the audio output. The Listen controls perform recognition, post processing, recording, and configuration of the speech recognizer. Ideally, the Basic controls are primarily designed for tap- and talk-based client devices and applications designed to confirm responses and manage application flow through a GUI.
The basic speech controls are designed exclusively to be called by client-side script. Examining the "Hello World" example in Listing 1
, you will notice that once the user presses the Web page button this then calls the OnClick client-side event. This event invokes the Start
method of the underlying prompt or exposed SALT element. The event processing for the basic speech controls is identical to features of SALT. Fundamentally, these features are based on the system's ability to recognize user input. The concept of recognition or "reco" is used by SALT to describe the speech input resources and provides event management in cases where valid recognition isn't returned. For example, you create specific events such as "reco" and "noreco" and then assign the name of these procedures to control properties such as OnClientReco
. When the browser detects one of these events, it calls the assigned procedure. The procedure is then able to extract information about the event directly from the event object.
The Listen control is a server-side representation of the SALT List element. The Listen element specifies possible speech inputs and provides control of the speech recognition process. By default, only one Listen element can be active at a time. However, a Web page can have more than one Listen control and each control can be used more than once.
The following code represents the HTML markup when a Listen control is added to a Web page.
As you can see, the main elements of the Listen control are grammars. Grammars are used to direct speech input to a particular recognition engine. Once the audio is recognized, the resulting text is converted and placed into an HTML output.
Dialog Speech Controls
The dialog speech controls, which include the QA, Command, and Semantic items, are designed to build questions, answers, statements, and digressions for an application. Programmatically, these controls are called through the script element, RunSpeech
, which manages both the execution and state of these controls. RunSpeech
activates a Dialog speech control using the following steps:
- RunSpeech establishes the Speech Order of each control based on the control's source order or SpeechIndex property.
- Runspeech examines the Dialog Speech controls on the page in Speech Order. Based on the order specified in the page, it locates the first dialog control within that list and then initializes it.
- RunSpeech submits the page.
The QA control within the Microsoft.Speech.Web.UI.QA namespace is used to ask questions and obtain responses from application users. It can be used as either a standalone prompt statement or can supply answers for multiple questions without having to ask them. Here's an example of how you can mark up this control.
<speech:QA id="SubType" runat="server">
"What type of submarine sandwich would
<Reco InitialTimeout="2000" BabbleTimeout="15000"
The Command control contained in the Microsoft.Speech.Web.UI.Command namespace enables developers to add out-of-context phrases or dialogue digressions. These are the statements that occur during conversations that don't seem to make sense for the given dialog. For example, allowing an application user to say "help" at any point. The following is an example of how you can apply this globally to a speech application.
<speech:command id="HelpCmd" runat="server"
<grammar id="GlobalHelpCmd" runat="server"
The SemanticMap and SemanticItem controls track the answers and overall state management of the dialogue. You use Semantic items to store elements of contextual information gathered from a user. While the semantic map simply provides a container for multiple semantic items, each SemanticItem maintains its own state. For example, these include empty, confirmed, or awaiting confirmation. You'll use the SemanticMap to group the SemanticItem controls together. Keep in mind that while the QA control manages the overall semantics of invoking recognition, the storage of the recognized value is decoupled from the control. This simplifies state management by enabling the concept of centralized application state storage. Additionally, this makes it very easy to implement mixed-initiative dialog in your application. In a mixed-initiative dialog, both the user and the system are directing the dialog. For example, the markup for these controls would look like the following.