devxlogo

Design and Implement a Voice-only Web Application in ASP.NET

Design and Implement a Voice-only Web Application in ASP.NET

f you think a voice-enabled commerce solution is still too sci-fi for you to take seriously, then you haven’t been paying attention. Voice recognition technology has arrived; it’s stable, it’s standardized, it works, and there’s demand for it. In this article, we will take you step-by-step through the process of building a demo application called “Commerce Voice.”

“Commerce Voice” is a voice-only version of the “IBuySpy” Commerce Sample in the ASP.NET Starter Kit. The application was built using Microsoft’s .NET Speech SDK.

Commerce Voice will show how to create a voice-only service from an existing Web application, by leveraging the existing business and data layers of the IBuySpy sample it is based on. To this end, the Web-based version is included with this sample to illustrate how the two presentation layers work together simultaneously on the same data. It is possible to place an order in the voice-only application and see that order immediately reflected in the Web version.

In the process we will demonstrate best-practice programming and design techniques for using the Speech SDK. The tools in the SDK allow the programmer a great deal of flexibility in making design decisions. We have made it a priority to show a consistent set of best practices for developing voice-only applications.

The finished application allows end users to do all of the following:

  • Order products quickly and easily by product number.
  • Browse and learn about products through a voice-based catalog.
  • Use a shopping cart, including the ability to add, review, modify, and remove products.
  • Review previous orders, including totals, dates, and products ordered.
  • Log in to an account securely using Windows Authentication.

Finally, this white paper includes lessons learned from the testing, design, and development stages, as well as thoughts about the differences between building visual applications for the Web and speech applications for telephony.

Using the Web-Based Application as a Development Blueprint
Commerce Voice shares its business- and data-layers with the Web-based ASP.NET Starter Kit sample. More information on that application can be found here:

ASP.NET Starter Kit Commerce Sample Whitepaper

The sample is a fictional commerce Website for selling spy-related products called “IBuySpy.” The purpose of the Web version is to showcase best-practice techniques for building a Web-based application in ASP.NET.

Code-Reuse
In essence, a voice-only version of an existing application is in fact a new presentation layer. The user interface becomes auditory rather than graphical. This means that the business logic and the data layer should essentially remain unchanged.

With a few exceptions, we have followed this concept as a development guideline. In the sample VS.NET solution, note that the CommerceVoice project file (see Figure 1) includes a reference to the Components folder in CommerceWeb (see Figure 2), the included Web-based version. Since the two applications share this same code, orders that occur in one interface are immediately reflected in the other.

Figure 1:The CommerceVoice project file includes a reference to the Components folder in the CommerceWeb application shown in Figure 2.


During the course of application development, we found it necessary to make minor additions to the business and data layers. These changes are outlined in the Lessons Learned section at the end of this document.

Suggested Enhancements
While the CommerceVoice sample application provides the implementation for the core features originally implemented in the Commerce Starter Kit, the following are ideas for extending the functionality of the CommerceVoice application:

Figure 2:This CommerceVoice project file includes a reference to the Components folder in the CommerceWeb application shown in Figure 1.

  • Add a Search Feature: Use the product names from the database to construct a grammar that allows users to find a product quickly. The grammar might be a long list of items, or more of a broad tree of items.
  • Add the ‘Most Popular Item List’ feature: Use the built-in capabilities of the Commerce Web components to prompt the user with the most popular products that week. Determine where and how to prompt the user without interfering with the overall usability of the application.
  • Support New User Registration: In the current CommerceVoice implementation the user must already have an account with IBuySpy.com in order to use the voice-only application. Extend the CommerceVoice application to allow the user to create an account. The challenge here would be how to handle the entry of the user’s personal information (i.e. name, username, password, PIN, etc.). One solution would be to limit the amount of information required to set up an account.
  • Add Product Reviews: Add a way for users to review products using CommerceVoice. One way to implement this is to allow the user to record a WAV file that would be attached to a product.

Designing the Application
We designed the system for a group of target users ranging from novices with little or no experience using voice-only systems to technophiles with a lot of experience. With this in mind, we tried to include advanced features to enable an experienced user to navigate the system quickly, while keeping the system simple and well-explained enough so that a novice would not feel lost.

Target User and Voice Personality
For the personality of the speaking voice, we had two goals:

  • Speed: The recorded sentences should be spoken at pace that a new user can easily understand and also have sufficient time to commit several commands to memory. An appropriate speaking pace helps usability by striking a balance between speaking so fast that users miss options and speaking so slowly that they begin to lose attention.
  • Mood: The system’s voice should be friendly, patient, and may use a bit of accentuation. Any voice-based system should make a user feel good using the system, both for the sake of usability and of providing a good user experience with the company.

Navigation Design
Designing a voice-only system is much different from designing traditional GUI-based applications. Whereas a Web page is a two-dimensional interface, the voice medium is one-dimensional. For example, a table of data on a Web page needs to be read item by item over the phone. As one designer put it, the challenge becomes, “How do you chop things up to establish a coherent flow of information? How do you express content in a way that the user can digest, understand, and then act upon?”

Start With a User-Centered Design Approach
We started our design process by following our standard methodology of user-centered design. The 80/20 rule is a good guide: 80 percent of the users use 20 percent of the application. We focused on ideal scenarios and common user paths rather than considering exceptional cases in the preliminary stages. We acted out sample dialogues that helped us get a better sense of how a typical conversation might go.

From these sample dialogues, we began creating high-level flow charts for each major component of the system (see Figure 3)

Figure 3: This diagram illustrates the high-level flow of the application.

In addition to the flow diagram above, several global commands are available to the user throughout the application:

  • Main Menu: Returns the user to the main menu.
  • Help: Provides the user with context-sensitive help text at any prompt.
  • Instructions: Provides instructions on the basic usage of the system and global commands available to them at any point.
  • Repeat: Repeats the most relevant last prompt. If the last prompt informed the user that his/her input was invalid, the repeat text will provide the user with the previous question prompt instead of repeating the error message.
  • Representative: Transfers the user to a customer service representative.
  • Goodbye: Ends the call.

Special Case: Implicit Confirmation
One of the more interesting navigational scenarios in the Commerce Application occurs when the user enters a product ID after saying “Start Shopping” from the main menu. We wanted to take advantage of the Speech SDK’s “implicit confirmation” feature here: if the product ID is recognized with high confidence and the recognized ID exists in the system, we want to bypass the explicit confirmation of that prompt. A typical scenario might look like this, illustrated by the flow diagram (see Figure 4).

System: If you know the three-digit product number of the item you want, say it now. If not, say browse.

User: 3 5 5 (Mumbled, recognized with low confidence).

System: I understood, 3, 5, 5, Rain Racer 2000. Is this correct?

User: No, I said 3 5 9 (Clearer, recognized with high confidence).

System: You selected product 3 5 9, Escape Vehicle (Water). How many would you like?

Figure 4: This high-level flow diagram for a typical scenario shows how the system would interact with a user to obtain and confirm a product ID number.

This scenario makes use of a combination of the Speech SDK’s answers, extra answers, and confirmations user input types. It makes possible complicated flow control situations.

Prompt Design
The design team found creation of a prompt specification document to be a challenge in itself. The number of paths available to the user at any one prompt leads to a complicated flowchart diagram that, while technically accurate, loses a sense of the conversation flow that the designers had worked to achieve. The design team arrived at a compromise specification that allowed them to illustrate an ideal scenario while also handling exceptions. The following example illustrates the beginning of the “Start Shopping” scenario from the main menu:

Prompt, Main Menu

Expected User Input

“Start Shopping”

Recognition

System Response

Recognized Expected Input

Remember, you can start over by saying main menu. If you know thethree digit product number of the item you want, say it now. If not, saybrowse.

Recognized Alternate Input: “Help”

You have reached the IBuySpy store. Our store is pretty simple.If you want to shop, say start shopping. To review your previous orders sayreview previous orders.

Prompt: Start Shopping

Expected User Input

“3 5 5”

Recognition

System Response

Recognized Expected Input

You selected product 3 5 5, Rain Racer 2000. How many would youlike?

Recognized Alternate Input: “Help”

You can place orders quickly by saying the three-digit productnumber. Say each digit with a clear pause between each number or enter it onyour Touch-Tone phone. If you don’t know the product number say browse.

This format of specifying functionality makes it very easy to conduct “Wizard of Oz” -style testing. In this scenario, the test subject calls a tester who has the functional documents in front of him/her. The tester acts as the system, prompting the test subject as the system would and responding to their input likewise. Trouble spots are easily identified and fixed using this style of testing.

How It Works
The following section is devoted to the architecture of the system. We start with an explanation of common user controls and common script files. Then we will go into detail on the browse feature, which provides a good encapsulation of many of the programming techniques used throughout the application. Finally, we’ll review some of the coding conventions and practices we used as best-practice techniques for development.

Common Files: User Controls
Two ASP.NET user controls are included on almost every page in our application. Together they encapsulate much of the functionality of the site, and each deserves discussion. Like designing Web applications, user controls in the ASP.NET Speech SDK can be used to provide a consistent user experience while saving a great deal of code.

GlobalSpeechElements.aspx
The GlobalSpeechElements user control is required on every page of the application (except for Goodbye.aspx and RepresentativeXfer.aspx, which do little more than read a prompt and transfer the user away). It contains the main stylesheet that defines common properties of the controls used throughout the application, as well as global command controls and common script files that provide client-side functional components.

  • MainStyleSheet: The Speech SDK style control is a powerful way of defining global application settings and assigning globally scoped functionality. In the Commerce sample we have four different styles:
  • BaseCommandStyle: This style is applied to all command controls. Its one attribute sets the AcceptCommandThreshold at .6, meaning that any command must be recognized with at least a 60 percent confidence rating to be accepted.
                                                         
  • GlobalCommandStyle: This style is applied only to the six global styles contained in GlobalSpeechElements. This style inherits the attributes of BaseCommandStyle and adds a dynamically set scope attribute. We want global commands to apply to all controls on any page they are included in, so we set the scope to be the parent page’s ID at runtime.
                              
  • BaseQAStyle: This style is applied to all QA controls that accept user input (QA controls which do not accept user input are called “Statements” and use the StatementQA style below). In addition to setting timeout and confidence thresholds, this style also defines the OnClientActive event handler for all QA controls. HandleNoRecoAndSilence is a JScript event handler that monitors a user’s unsuccessful attempts to say a valid response and transfers the user to customer service after enough unsuccessful events. It is described in the section on Common Script files below.
                                                                        
  • StatementQA: For QA controls that do not accept user input, we want to disable BargeIn?the act of interrupting a prompt before it ends with a response?and turn on PlayOnce, which ensures the prompt is not repeated. Normal QA controls are activated when their semantic item is empty; since Statement QA controls have no semantic item, the control would be played over and over again if PlayOnce was turned off.
                                                         
  • Global Commands: The global commands in GlobalSpeechElements (described in the Navigation Design section) each have associated with them a command grammar file that defines how the command is activated.
    Figure 5: One category of global commands affects the current prompt.

    Commands fall into two categories: those that affect the current prompt, such as HelpCmd, InstructionsCmd, and RepeatCmd (see Figure 5), and those that trigger an event (RepresentativeCmd, GoodbyeCmd, MainMenuCmd). For the former, the prompt function looks for a particular Type value in its lastCommandOrException parameter and creates an appropriate command. For the latter, the command’s associated OnClientCommand event handler is executed.

                  
  • Common Script File Includes: GlobalSpeechElements is an ideal place to include references to all global script files. These files constitute all global client-side event handlers and prompt generation/formatting routines for the application. Since they are included in the control, individual pages can rely on their availability without explicitly including them.
    

SelectableNavigator.aspx
The SelectableNavigator user control is used on any page that needs a dynamically generated list of items from which the user may select. While the Navigator application control included in the Speech SDK can read a list of items, it does not allow the user to select one of the items in the list. The SingleItemChooser application control does allow the user to select an item, but it is unwieldy for large lists. The SelectableNavigator contains a Navigator application control, as well as a QA control and a Command control (see Figure 6).

Figure 6: The SelectableNavigator control contains a Navigator application control, a QA control, and a Command control.
  • InitialStatement: The prompt of the InitialStatement QA is used to tell the user something about the list. Originally, we had this initial statement as part of what the navigator says for its first item. However, if the user mumbles, this initial statement and the first item are lost. Since we wanted to ensure the user heard the first item, we separated the initial statement from the first item. This way, even if the user mumbles during the initial statement, they will still hear the first item after the system recovers.
  • TheNavigator: The Navigator takes care of the tasks associated with reading and navigating through the list of items associated with the control.
  • SelectCmd: This command, scoped to TheNavigator, allows the user to select an item. The grammar for this command may be specified dynamically by setting the IsDynamic property of the SelectableNavigator to true (the default). The grammar always contains at least “Select” and “That one” to select the current (most-recently read) item but if this flag is set, the grammar also contains the items found in the first field specified in the DataHeaderFields property. Thus, if the user is on the first item and says the name of the fifth item, the fifth item is selected.

Since user controls have no designer support, all properties are set programmatically in the code-behind file. In addition, this means dragging the user control onto a page does not add additional speech mark-up to the page as does dragging one of the Speech SDK’s controls, such as a QA, onto the page. This isn’t normally a problem since you will usually want other speech controls on the page, such as a semantic map. However, something else you don’t automatically get is a prompt function file to keep the prompt function for the SelectableNavigator. Again, this isn’t normally a problem so long as there are other controls on the page that will want a prompt function. For instance, using the Property Builder for a QA to add a prompt function will automatically add a prompt function file if one does not already exist for the page. After doing so, you may add a prompt function to that file yourself for a SelectableNavigator on the same page.

Because the Speech SDK is very client-side-heavy, there are two client-side JScript functions to write to handle two events fired by the SelectableNavigator, OnCancel and OnSelect.

  • OnCancel: The SelectableNavigator fires this event if the user says “cancel” while in the SelectableNavigator. Since the Navigator’s built-in cancel command deactivates the Navigator, RunSpeech will skip the SelectableNavigator during subsequent iterations.
  • OnSelect: The SelectableNavigator fires the event if the user selects an item, either by saying “select” or the name of an item if IsDynamic is true. Return true from this handler to deactivate the SelectableNavigator.

The client- and server-side properties and methods exposed by the SelectableNavigator are documented in SelectableNavigator.html.

Common Files: Client-Side Scripting
The globally scoped client-side script files for the application are:

  • Speech.js: NoReco/Silence event handler and object accessors
  • Routines.js: String-formatting routines
  • Debug.js: Client-side debugging utilities
  • CommerceV.js: Global Navigation Event Handlers
  • PromptGenerator.js: Prompt Generation Utility

A few of the more interesting functions of these scripts are outlined below:

HandleNoRecoAndSilence (Speech.js)
HandleNoRecoAndSilence takes care of handling cases where the user repeatedly responds to a prompt with silence or with an unrecognizable input. To avoid frustration, we don’t want to repeat the same prompt over and over again. This function, executed each time a QA is made active, counts the number of consecutive times the input is invalid. It increments a counter that the prompt generation utility (see below) uses to generate an appropriate prompt. If the count exceeds a maximum (in this application, three), we redirect the user to a Customer Service Representative.

This function is defined as the OnClientActive event handler for the BaseQAStyle in the GlobalSpeechElement’s MainStyleSheet. Each QA that accepts user input must use this style in order for the function to be called correctly.

   function HandleNoRecoAndSilence(         eventSource, lastCommandOrException,       count, semanticItemList)   {      if (count == 1)         PromptGenerator.noRecoOrSilenceCount = 0;         if (lastCommandOrException == "Silence" ||          lastCommandOrException == "NoReco")      {         PromptGenerator.noRecoOrSilenceCount++;                  if (PromptGenerator.noRecoOrSilenceCount >=            representativeXferCount)            Goto(representativeXferPage);      }      else      {         PromptGenerator.noRecoOrSilenceCount = 0;      }   }

Navigator Functions (Speech.js)
The Navigator functions make working with the Navigator application control easier:

  • ActivateNavigator(navigatorName, active):
    In the Speech SDK, speech-controls are activated and deactivated by modifying the semantic state of the control’s associated Semantic Item. The same is true for Navigator application controls, though the semantic item is hidden from the user. In order to make activation and deactivation of Navigators simpler, we created a function that sets the Navigator’s “ExitSemanticItem” to some dummy value. If the value is empty, the Navigator is activated. If not, the Navigator is inactive.
   function ActivateNavigator(navigatorName, active)   {      var si = eval (navigatorName + "_ExitSemanticItem");            if(active || arguments.length == 1)         si.Clear();      else         si.SetText("x", true); // value can be anything            return active;   }
  • GetNavigator(navigatorName): Returns a Navigator object reference given its name as a string.
  • GetNavigatorCount(navigatorName): Returns the count of items in the given navigator.
  • GetNavigatorData(navigatorName, columnName): Returns the data contained in the currently-selected row of the specified navigator’s specified column.
  • GetNavigatorQA(navigatorName): Returns a reference to a Navigator’s internal QA control.

Prompt Generation (PromptGenerator.js)
Prompt Generation is perhaps the most central element when creating a successful voice-only application. Providing a consistent voice interface is essential to creating a successful user experience. PromptGenerator.js does just this by encapsulating all common prompt-generation functionality in one place.

A prompt function in a typical page will always return the result of a call PromptGenerator.Generate() as its prompt:

   return PromptGenerator.Generate(      lastCommandOrException,       count,      "Prompt Text Here",       "Help Text Here"   );

Notice that the prompt function passes both its main prompt and its help prompt into the function every time. PromptGenerator.Generate() decides the appropriate prompt to play given the current lastCommandOrException, the NoReco/Silence state (see the topic HandleNoRecoAndSilence), and other factors:

   function PromptGenerator.Generate(      lastCommandOrException, count, text, help)   {      help += " You can always say Instructions " +        "for more options."            switch (lastCommandOrException)      {         case "NoReco":            if (PromptGenerator.noRecoOrSilenceCount > 1)             return "Sorry, I still don't understand "             "you. " + help;            else               return "Sorry, I am having trouble " +                 "understanding you. " +                  "If you need help, say help. " + text;         case "Silence":            if (PromptGenerator.noRecoOrSilenceCount > 1)               return "Sorry, I still don't hear you.  " +                  help;            else               return "Sorry, I am having trouble " +                  "hearing you. " +                  "If you need help, say help. " + text;         case "Help":            PromptGenerator.RepeatPrompt = help;            return help;         case "Instructions":            var instructionsPrompt =                "Okay, here are a few instructions...";            PromptGenerator.RepeatPrompt = instructionsPrompt               + text;             return instructionsPrompt;         case "Repeat":            return "I repeat: " +               PromptGenerator.RepeatPrompt;         default:            PromptGenerator.RepeatPrompt = text;            return text;      }   }
Author’s Note: Some of the longer strings have been shortened here to save space.

A note on “Repeat”: The PromptGenerator.RepeatPrompt variable stores the current text that will be read if the user says “Repeat.” The first time the function is executed for any prompt, the RepeatPrompt will be set to the standard text. The RepeatPrompt is then only reset when the user says “Help” or “Instructions.”

Other PromptGenerator functions: PromptGenerator also includes a number of other functions for generating prompts in the application. They include

  • GenerateNavigator(lastCommandOrException, count, text, help): This function adds to the functionality of Generate() by including standard prompts commonly needed while in a Navigator control. These prompts include additional help text and messages for when the user tries to navigate beyond the boundaries of the navigator.
  • ConvertNumberToWords(number, isMoney): In order to generate recorded prompts for all possible number values, we must convert numbers (i.e. 123,456) to a readable string (i.e. “one hundred twenty three thousand four hundred fifty six”). This reduces the number of unique words that must be recorded to a manageable amount.
  • ConvertDateToWords( dateString ): Like ConvertNumberToWords, this function converts dates to a prompt-ready format (i.e. “12/2/02” becomes “December Twelfth Two Thousand Two”).

Designing Your Grammar
Items in your grammar files define what words and phrases are recognized. When the Speech engine matches an item from the grammar file, it returns SML, or Speech Markup Language, which your application uses to extract definitive values from the text that the user spoke. Having too strict a grammar will result in no flexibility from the user’s perspective in regards to what they can say; however, too many unnecessary grammar items can lead to lower speech recognition.

Preambles and Postambles
Very often, you will want to allow a generic “preamble,” text said before the main item, and “postamble,” text said after the main item. For instance, if the main command is “Buy Stock,” you would want to allow the user to say “May I Buy Stock please?”

Typically, you can use one grammar (.grxml) file for your preambles and one for your postambles. Within your other grammar rules, you can then reference the pre- and post-ambles by using RuleRef’s.

Tip: Make the pre- and post-ambles generic and robust enough that you don’t limit your users’ experience, but keep them reasonable in size so that you don’t risk lowering the speech recognition for your main elements.

Use the Grammar Editor tool to graphically set up grammar files (see Figure 7). The basic task is to set up a text phrase or a list of phrases, and then assign a value that you want your application to use when each phrase is recognized.

Figure 7: Use the Grammar Editor tool to graphically set up grammar files.

We found that the following strategies helped us in grammar development:

Typically, if we only need to recognize that a text phrase has been matched, especially in the case of commands, we fill in the Value field with the empty string rather than a value. For example, if you want to capture when the user says “Help,” you can simply return the following SML:

                     
Figure 8: Use rule references within grammar files to avoid duplicating the same rule across different speech controls.

The control associated with this grammar file recognizes the phrase, and returns the SML element “GoHelp”; the code-behind or client-side script makes a decision based on the SML element being returned, rather than the value.

Use rule references within grammar files to avoid duplicating the same rule across different speech controls. Tip: You must make sure that a rule to be referenced is a public rule, which you can set through the properties pane (see Figure 8).A common grammars file is included with the Speech SDK, both in an XML file version (cmnrules.grxml) and in a smaller, faster compiled version (cmnrules.cfg). We copied the compiled version into our project and used it for commonly used grammar elements, such as digits and letters in the alphabet.

Coding Conventions
Server-Side Programming
Unlike traditional ASP.NET programming, the Speech SDK is primarily a client-side programming platform. Although its controls are instantiated and their properties manipulated on the server-side, controlling flow from one control to another is primarily a client-side task.

The controls offer opportunities to post back to the server automatically, including the SemanticItem’s AutoPostBack property and an automatic postback when all QAs on a page are satisfied. As a convention, though, we chose to only post back when we needed to access data or business layer functions. Most of our code is written through client-side event handlers, using SpeechCommon.Submit() to post back explicitly when data was needed from the server.

Client-side Scripts
Because JScript lacks many of the scoping restrictions found in C# or VB.NET, it is possible when programming on the client-side to perform a certain task in many different places. The SpeechCommon object is accessible from any client-side script’ its Submit() method can be executed from event handlers, prompt functions, or any helper routines as well. For this and other reasons, we have followed a set of guidelines for the usage of these various components:

  • Prompt Functions Are Only For Generating Prompts: Never perform an action inside a prompt function that is not directly related to the generation and formatting of a prompt: no navigation flow, semantic item manipulation, etc. Besides good practice, the other key reason for reserving prompt functions only for generating prompts is validation. If prompt functions contain calls to SpeechCommon or other in-memory objects, those objects must be declared and their references included in the “Validation References” for the prompt function. If these references are not included, validation will fail for the function. As a rule, the only functions referenced by prompt functions are in PromptGenerator.js. One exception to this rule was necessary. Navigator application controls do not expose events that are equivalent to OnClientActive, or which fire each time a prompt function is about to be executed. For QA controls, we use OnClientActive to call HandleNoRecoAndSilence, which monitors consecutive invalid input for a QA. We expect future versions of the SDK to expose this type of event in the Navigator control, but until then, we call HandleNoRecoAndSilence from PromptGenerator.GenerateNavigator.
  • No Inline Prompts: Inline prompts may seem attractive when the prompt text never changes, but they introduce a maintenance issue when being used with recorded prompts. Unlike prompt functions, which reference prompt databases through the values in web.config, inline prompts explicitly copy these values into the prompt tags. Should the location of the prompt database change (as it will most likely do between development, staging, and production) each inline prompt must be modified in addition to the web.config file. Since the cost of using a prompt function is so low, we avoided inline prompts altogether.
  • Control of Flow Handled In Event Handlers: Flow control is the most important function of event handlers and client activation functions. Most applications that have any complexity require a more complicated flow control than the standard question-and-answer format afforded by laying QA controls down in sequence on a page. For the most part, we achieved this control by manipulating the semantic state within event handlers.

Naming Conventions
We used the following naming conventions throughout our application for consistency:

  • QA Controls: The QA Control can be used for a variety of purposes. We distinguish these purposes by their function: traditional question-and-answer controls fill a semantic item with the result of user input, confirmations confirm a pre-filled semantic item, and statements are output-only; they do not accept user input.
  • Question-And-Answer: QA (e.g. AddToCartQA)
  • Confirm: Confirm (e.g. NumberOfItemsConfirm)
  • Statement: Statement (e.g. RestartBrowseStatement)
  • Navigator Controls: Nav (e.g. CategoryNav)
  • Commands: Command (e.g. BrowseCommand)
  • Semantic Items: si (e.g. siProductID)

JScript and C# server-side code use naming conventions standard in those environments.

In-Depth: Browse Feature
Next, we’ll show how all of these common elements are used to build the Browse feature. In the CommerceVoice application the user can shop for products by browsing the product catalog. First, the user selects a category from the list of categories and then selects a product from the list of products in that category. Once the product is selected, users can find out more about that product and add it to their shopping cart. Figure 9 shows the interaction diagram.

Figure 9: The Browse interaction diagram shows how users shop for products by browsing the product catalog.

In the CommerceVoice application, there are seven categories and an average of six products per category. Keeping the list of categories and products relatively short seemed to help the usability of the application.

Page Setup
Like almost all pages in the CommerceVoice application, we began the Browse page by adding a new C# Web Form to our project. We then placed a GlobalSpeechElements user control onto the page that provides global commands and the stylesheet used for the speech controls. Grouping common elements like this into a user control accelerated development and provided consistency across the application. Nothing else was required to use the GlobalSpeechElements user control on the Browse page.

Semantic Items
A semantic map control is added next that contains the semantic items we use on the page. Semantic items are generally associated with a particular speech control and contain the user’s answer to a question. Text extracted from the SML document returned by the grammar file is placed into the semantic item. For the Browse page, the following semantic items are used:

Semantic Item Name

Control

Description

siCategory

CategoryNav

Holds the category selected by the user.

siProduct

ProductNav

Holds the product selected by the user.

siAddToCart

AddToCartQA

Used to determine if user said ‘Add to Cart’

siNumberOfItems

NumberOfItemsQA

Number of items of selected product to add to shopping cart

In addition, when the semantic map control is dragged onto the page, a reference to the speech controls is added to the HTML.

   <% @ Register TagPrefix="speech"       Namespace="Microsoft.Web.UI.SpeechControls"       Assembly="Microsoft.Web.UI.SpeechControls,       Version=1.0.3200.0, Culture=neutral" %>

Dragging any speech control onto the Web form will add this important reference to the page.

Semantic Item States
Semantic items play an important role in controlling the flow of execution on a speech enabled Web form. Semantic items can have three states:

  • Empty: Value is not filled (this is the default state).
  • Needs Confirmation: Value is filled in but confidence is below threshold.
  • Confirmed: The value is filled in and is confirmed.

When the page executes, the RunSpeech engine controls the flow of execution for the controls on the page (i.e. it determines which control to execute next). If the state of all semantic items associated with a QA control is Empty, the RunSpeech engine will activate that QA control. Otherwise, that control will be skipped. In this way, programmatically setting the state of semantic items on a page allows us to customize the flow of execution.

Page Execution
When the Browse page first loads, the list of categories is retrieved from the database and loaded into the CategoryNav SelectableNavigator user control.

   private void LoadCategories()   {      ProductsDB products = new ProductsDB();      SqlDataReader drCategories =          products.GetProductCategories();         DataTable dt = new DataTable();      dt.Columns.Add("CategoryID", typeof(int));      dt.Columns.Add("CategoryName", typeof(string));      dt.Columns.Add("CategoryDescription", typeof(string));      dt.Columns.Add("Products", typeof(int));      while (drCategories.Read())         dt.Rows.Add(new object[4] { drCategories[0],             drCategories[1], drCategories[2],             drCategories[3] });      CategoryNav.Initialize(          "CategoryID,CategoryDescription,Products",         "CategoryName", dt, "ReturnToMainMenu",          "SelectCategory", "CategoryNav_prompt",          "CategoryNavInitialStatement_prompt");   }

The ProductsDB component that is being reused from the CommerceWeb application returns a DataReader, which the SelectableNavigator control does not support as a data source. As a result we insert the categories into a DataTable and assign that as the data source of the SelectableNavigator

Only one call to the SelectableNavigator is required to initialize it and load it with category data. Client-side functions specified for OnSelect, OnCancel, etc. are located in the Browse.js file.

Selecting Categories and Products
Now that we’ve loaded the CategoryNav selectable navigator with categories, we prompt the user to select a category from the list. The CategoryNav control allows for an initial prompt to be read to the user and then proceeds to read each category in the list. The CategoryNav_prompt function is indicative of SelectableNavigator prompt functions used throughout the CommerceVoice application.

   {       var text;          switch(lastCommandOrException)       {           case "NVG_contents":               text = "";               lastCommandOrException = "NoReco";                  break;              case "NVG_headers":               lastCommandOrException = "";               // Fall through.              default:               text = categoryName;       }          return PromptGenerator.GenerateNavigator(           lastCommandOrException,           count,           text,           categoryName +               ". " +               PromptGenerator.ConvertNumberToWords(products)               + " " + description + " ");   }

The prompt function determines what to read back to the user based on the lastCommandOrException parameter. In this function, when they request the contents of the category (i.e. by saying “Read”), we treat the user’s response as a no recognition condition. Also, because navigator application controls have their own unique error handling which does not apply to normal QA controls, the PromptGenerator.GenerateNavigator function is called instead of PromptGenerator.Generate.

Parameters are used to retrieve the category name used to read the categories to the user and for more detailed information when the user asks for help (see Figure 10).

Figure 10: Setting the parameters used to retrieve category names and detailed help information.

The SelectableNavigator control allows us to store multiple columns of information for each category, such as the number of products in a category and the category description.

The SelectableNavigator treats silence as the “Next” command when reading categories to the user. If the user is silent, the next category is read. This is the default behavior of the navigator application control encapsulated by the SelectableNavigator control.

When the user selects a category from the list, the SelectCategory client-side handler function is called.

   function SelectCategory()   {      siCategory.attributes["CategoryID"] =          CategoryNav.Item("CategoryID");      siCategory.SetText(CategoryNav.Item(         "CategoryName"), true);            return true;   }

We use the attribute collection associated with the semantic item to store related information. In this case, the category ID is stored along with the category name in the semantic item. The true parameter in the SetText function call changes the state of the semantic item to Confirmed.

Retrieving the List of Products
The AutoPostBack property for the siCategory semantic item is set to true. This means when the state changes to NeedsConfirmation or Confirmed, the page is automatically posted back to the server. In the Page_Load event, we check the state of siCategory to determine if we can load the products for the selected category into the ProductNav user control. The selected category is passed back to the database to retrieve the list of products for that category.

   if (siCategory.State != SemanticState.Empty)   {      LoadProducts ();   }

Adding Items to the Cart
The AddToCartQA is used to read the selected product and price to the user and to determine if the user wants to add the product to their shopping cart. First, we assign the BaseQAStyle to the QA defined in GlobalSpeechElements1. As described earlier, this provides us with common threshold settings and adds support to handle the three mumbles or silences in-a-row case.

IsProductSelected is a Client Activation function for this QA. RunSpeech calls this function to determine if the QA is available for activation. Returning true allows RunSpeech to activate the control. We only want this QA to be active if the user has selected a product.

   function IsProductSelected (id, lastCommandOrException,       Count)   {      return siProduct.IsConfirmed();   }

The value of the siAddToCart semantic item is only used to control program flow on the page. If the item is empty and the user has selected a product, RunSpeech will activate the AddToCartQA. Once the user tells the system they wish to add the product to their cart, the siAddToCart semantic item is no longer empty and RunSpeech moves on to the next QA on the page.

There are two additional commands scoped only to the AddToCartQA: the Description command and the First command. The Description command is used to play back a description of the product. Because the product descriptions are very long, prompts are constructed in a special way described in the Recording_Long_Prompts section of this document.

The other command scoped to this QA is the First command. When the user is in the SelectableNavigator control hearing a list of products, they can say “First” to navigate to the first item in the list. We wanted that same functionality for this QA. When the user says “First,” the OnFirstCmd client handler function is called.

   function OnFirstCmd (smlNode)   {      CategoryNav.Activate(false);      siProduct.Clear();         ProductNav.Activate(true);   }

Notice that the category semantic item is not cleared and the CategoryNav SelectableNavigator is not activated since we want to use the same category. Also, since the CategoryNav and ProductNav controls are already loaded with category and product data, we do not need to post back to the server to retrieve this data from the database again.

Specifying the Number of Items
Now that the user wants to add the product to their shopping cart, we ask them how many items of the selected product they wish to add. We use the cardinal_999 rule from the Grammar Library (the cmnrules.cfg file) that ships with the Speech SDK. Refer to the SDK documentation for more information on the Grammar Library. The siNumberOfItems semantic item is filled with the number of items for this product the user wishes to add to their shopping cart.

At this point, we allow the user to say “Cancel” to return to the list of products. Instead of returning them to the first item in the product list, we return them to the item they previously selected. The Cancel command is scoped only to the NumberOfItemsQA. The ReturnToProductList client handler function is called when the user says Cancel.

   function ReturnToProductList (smlNode)   {      CategoryNav.Activate(false);      ProductNav.Activate(true);      ProductNav.GetNavigator().Index =          siProduct.attributes["NavIndex"];      siProduct.Clear();         siAddToCart.Clear();   }

When the product is selected we save the index of the product in the list in the attributes collection of the siProduct semantic item. ReturnToProductList sets the index of the ProductNavigator to this saved value so that when RunSpeech activates the ProductNav control again, the previously selected product will be read back to the user. The siAddToCart semantic item is also cleared so that when the user selects another product they are prompted to add the item to their cart.

Confirming the Number of Items
The NumberOfItemsConfirm QA confirms the siNumberOfItems semantic item set in the previous QA. Note that in the Answers tab of the NumberOfItemsQA, the Confirm Threshold for the siNumberOfItems semantic item is set to 1 (see Figure 11).

Figure 11: Setting the Confirm Threshold for the number of items.

By setting the Confirm Threshold to 1, we require the semantic item to always be confirmed. Setting the Confirm Threshold to a lower value would require confirmation of the item based on the confidence level. For instance, setting the Confirm Threshold to .5 would only require a confidence level of .5 or greater to automatically confirm the item. In that case, the NumberOfItemsConfirm QA would be skipped by RunSpeech since the item was already confirmed. Setting the Confirm Threshold to 1 ensures that this QA will never be skipped by RunSpeech.

NumberOfItemsConfirm QA
The grammar for the NumberOfItemsConfirm QA allows the user to say “Yes,” “No,” or “No, I said three” (or any valid cardinal number). You can see this by looking at the QuantityConfirm grammar diagram (see Figure 12).

Figure 12: Here’s the grammar diagram for the QuantityConfirm interation.

Saying “Yes” confirms the item. RunSpeech will automatically set the state of the siNumberOfItems semantic item to Confirmed. Saying “No” will set the semantic item’s state to Empty. In this case, since the semantic item’s state is Empty, the previous QA (NumberOfItemsQA) would be activated again and the user would be prompted for the quantity.

Saying “No” and a different number will set the state of the semantic item to empty, but would also fill its value with the new quantity without activating the previous QA. This provides the user familiar with the CommerceVoice application with a way to correct the quantity value quickly. This is easy to set up using the property builder for the NumberOfItemsConfirm QA (see Figure 13).

Figure 13: Use the property builder for the NumberOfITemsConfirm QA to give the user a chance to correct a quantity value by saying “No.”

First, notice that the Confirms tab is used instead of the Answers tab. This tells RunSpeech to confirm the semantic items in the list. We use XPath to tell RunSpeech what to extract from the SML document that is returned from the grammar file. For a simple Yes/No confirm, all that is required is a grammar that returns yes or no. We also allow the user to also specify a new quantity by providing an XPath trigger for siNumberOfItems.

Prompt Databases
The standard Text-To-Speech (TTS) engine may work well for development and debugging, but recorded prompts make a voice-only application truly user-friendly. Though the process can be tedious, Microsoft’s prompt validation utilities and recording engine make the process easy.

Validation
Thorough validation is important to make sure that no prompts are being missed. A few general strategies enabled us to make sure that our prompt generation functions were being validated completely and accurately:

  • No object-references within prompt functions: Except for calls to PromptGenerator.js, we never make calls to script objects within the body of our prompt functions. Instead, our prompt function arguments are defined so that all function calls are made before the inner prompt function is executed. This avoids errors on validation that prevent prompts from appearing. Example: In Figure 14, note the call to insertSpaces() in the productID variable declaration. A product ID (e.g. “355”) must be separated into its component digits to be read correctly by recorded prompts. We make the call to the helper function that does this in the variable declaration and provide an already-formatted version of the productID (e.g. “3, 5, 5”) as the validation value.
Figure 14: Create the call to insertSpaces() in the productID variable declaration.
  • Stand-In Validation Values: When it comes to validation values with a large number of potential values (i.e. numbers, dates, product names, etc.) we always provide a stand-in validation value that represents the entire set for the validator. We then make sure that the entire set is recorded in the prompt database. For instance, by using “Rain Racer 2000” for the product name whenever it is passed into a prompt function, we need only record this one product name for debugging purposes. When the product is ready for testing or professional voice-talent, we then go through and add the rest of the product names.

Achieving Realistic Inflection
The following techniques allow us to make our prompts play as smoothly as possible when reading strings that involve combining many different recordings (i.e., “[This product costs] [two] [dollars and] [fifteen] [cents]”). (Note: throughout this section, individual prompt extractions are identified with brackets, just as they are in the prompt editor.)

  • Record Extractions in Context: Prompt extractions usually sound more realistic when spoken in context. While it may be tempting to record common single words like, “items,” “dollars,” and, “products,” as individual recordings, they will sound much better when recorded along with the text that will accompany them when they are used in a prompt: “one [item],” “two [items],” etc. In one highly effective example, we recorded all of our large number terms in one recording: “one [million] three [thousand] five [hundred] twenty five dollars.”
  • Recognize and Group Common Word Pairings: When recording singular words like “item,” “dollar,” and “product,” we almost always group them with “one” as they will always be used this way. Our extractions become, “[one item],” “[one dollar],” and “[one product].”
  • Use Prompt Tags: Although we may have recorded “two” and “thousand” in other extractions, these two words together constitute part of any prompt that includes a current date. So, it makes sense to include, “two thousand,” as an individual extraction itself. Further, by using the tags, “YearComplete,” and “YearIncomplete,” we can record it twice and distinguish between its use in “January First, Two Thousand,” where its inflection should drop at the end, and, “January First, Two Thousand Two,” where its inflection should rise at the end. We then insert a tag reference in the prompt generation routine. (The following snippet is taken from ConvertYearToWords() in PromptGenerator.js):
   if (year >= 2000)   {      year -= 2000;      yearString = "two thousand";      if (year == 0)         return "" +            yearString + "";      else         yearString = "" +            yearString + "";   }
  • Use Display Text To Your Advantage: To achieve high-quality extractions when recording sentences, we modified the display text column of our transcriptions to indicate where the extractions were. As an example, the transcription, “[Order number] 5 8 3 [This order has] one item” has the display text, “Order number, 5 8 3, This order has, one item.” Commas are inserted between extractions. During recording, the voice talent can pause at the appropriate places so that the extractions are recorded clearly.

Recording Long Prompts
Although the prompt editor’s automatic alignment feature is a powerful tool, it becomes unwieldy when recording lengthy prompts. In CommerceVoice, we needed to record the descriptions of all the products in the product catalog. Each description averaged more than 60 words. In addition, using the alignment tool would require us to update our prompt database each time a description changed in the database, a costly and time-consuming prospect.

Instead of using the alignment engine, we bypassed the issue by aligning the entire description with one alignment: PRODUCT_DESCRIPTION_ see Figure15):

Figure 15: Align the entire description with one alignment.

The prompt editor requires that each individual word of the transcription matches an individual alignment within the waveform, so the transcription text must match this alignment (see Figure 16):

Figure 16: The prompt editor requires that each individual word of the transcription matches an individual alignment within the waveform.

Finally, in our prompt functions, we refer to descriptions by outputting this product description keyword, simply the concatenation of the “PRODUCT_DESCRIPTION_” prefix and the product ID:

   if (lastCommandOrException == "Description")      text = "Here is the description: PRODUCT_DESCRIPTION_"          + productIDNoSpaces; 

Besides the obvious advantage of avoiding large numbers of alignments, we also avoid having to retrieve product descriptions from the database in order to play them. We also separate the voice-only version of the description from the text version, allowing us to make changes to the Website without having to re-record our prompts every time.

Running the Application
Our user tests were designed with two main goals in mind:

  • Verify that the system performed well in real-life scenarios: The main goal is simply to verify that testers can manage the basic tasks that real customers would want to perform.
  • Exercise the full feature-set of the application: In addition to testing standard goals, it was important to make sure that the complete feature set of the application was tested as well. Testers were guided to parts of the system that might not necessarily be on a most-likely-path scenario, in order to make sure that the entirety of the system worked as expected.

To accomplish these goals, we gave our testers scenarios that included both common tasks and special-case scenarios designed to guide the user toward special situations. A sample script might look like this:

TASK ONE (Product-Number-Driven Ordering)

  1. You noticed a product in a magazine that is sold at the IBuySpy store. The product number was listed as 3-6-0. Purchase this product.

TASK TWO (Catalog-Based Ordering and Shopping Cart Review)

  1. Needing more power to persuade, you want to buy the Persuasive Pencil, a product found in the Communications category. Please purchase one unit.
  2. Sustaining damage to your car on your last mission, you want to purchase the Universal Repair System, but only if it is safe for cars. Check the product description and if it is safe, purchase two.
  3. Deciding that it may also make sense to?have a Persuasive Pencil for your vacation home, you decide it makes more sense to buy two pencils instead of one. Update your order.
  4. Finish the transaction and make a note of the order ID. End the call.

TASK THREE (Review Previous Orders)

  1. With the order ID from the previous task, check to see if your order shipped.
  2. You’ve become confused about how many Persuasive Pencils you ordered. Check the order details to verify the quantity shipped.

Test subjects were given account numbers and PINs to log into their account, but otherwise were left alone to complete the tasks. Tests were repeated with a number of different test subjects and over a number of successive product revisions.

Lessons Learned
We learned a great deal about building voice-only applications through the process of building these samples. Here we note some of the major points in the areas of user testing, design, and development.

  • Testing: The testing and tuning phase is important in any application, but in terms of design, it is especially important in voice applications. We found that tuning our prompts, accept thresholds, and timeouts were key to making the application useful. Here are a few suggestions on how to conduct effective testing and tuning for voice-only systems.
  • Properly Configure Testing Equipment First: Many of our early user tests generated numerous usability problems that were due to improper configuration of the microphone. The microphone was too sensitive, picking up background noise, feedback from the speaker output, and slight utterances as user input. Users became increasingly frustrated as they found it difficult to hear a prompt in its entirety. This affected test results significantly.
  • Select Testers Carefully:We found that testing subjects brought a variety of expectations to the testing process. Developers whom we used as subjects often made assumptions about the way the system was working and became confused with ambiguous prompts like, “Would you like to start shopping or review your previous orders?” They preferred more explicit choices: “Say start shopping to start shopping or review orders to review your account history.” Testers with a less technical background preferred less structured prompting; they felt they were speaking with a more friendly system.

    To conduct effective tests, make sure the user group you are testing matches the target user group for your application.

Design
The most important lesson designing the application was the importance of tuning the prompt design throughout development. From the first stages of implementation through user testing of the completed system, we made changes to prompts to achieve a more fluid program flow. Our experience speaking with other teams who have attempted similar projects is that this is a fundamental part of voice-only application development.

With that in mind, here are a few points that will make the tuning process much more efficient:

  • Long Prompts Don’t Equal Helpful Prompts: At the outset, our design team approached the goal of a friendly interface by writing friendly text. Testing quickly revealed that verbose prompts were a serious impediment to usability. By keeping prompts short, users understood better what to do.
  • Express Sentiment with Tone/Inflection: We found that helpfulness is best expressed through intonation and inflection, rather than extra words. A prompt like, “I’m sorry. I still didn’t understand you. My fault again,” expresses an apologetic sentiment on paper quite well, but spoken, it becomes excessive. This prompt became, “I’m sorry. I still didn’t understand you,” and we let the inflection of the speaker express the emotion. A good rule of thumb: speak prompts first before writing them down.
  • Build Cases For Invalid (but likely) Responses: Our tests surprised us when a majority of users answered, “Yes,” to the question, “Would you like to start shopping or review your previous orders?” We realized that part of the problem was the way in which the question was asked, but still, we built in a command to accept that response and provide a helpful response.
  • Keep the Number of Options Small: We found that listing more than three or four choices in a prompt dramatically reduced usability. Users would get confused and would not remember their choices. We made every effort to reduce the number of options offered to a user in any given prompt.
  • Maintain a Prompt Style Guide: Design teams are used to maintaining style guides for their designs, and voice-only applications should be no exception. Having a consistent set of prompt styles and standard phrasings is paramount to creating a sense of familiarity for the user. Our team recommends an iterative process: modify the guide liberally in the early stages of a project as new cases arise. Then, toward the later stages, tweak new cases to fit the existing rules. This process should lead to a consistent user experience throughout your system.

Development
We needed to make several changes to our development strategy worth noting here.

  • Necessary Modifications to the Business and Data Layers: The concept of building a voice-only presentation layer as a replacement for a GUI necessitates a few changes to the database and business logic layers we didn’t foresee. These changes both relate to the types of data required by the particular constraints of the voice medium:
  • Pluralized Names: In a GUI context, quantities of items are usually expressed in some sort of table format that very closely resembles a table in a database:

ProductName

Quantity

CounterfeitCreation Wallet

2

ContactLenses

4

In a voice-only context, while it is possible to read this information as, “Product Name: Counterfeit Creation Wallet, Quantity: 2, Product Name: Contact Lenses, Quantity: 4” it is preferable to read it as, “Two Counterfeit Creation Wallets, and Four Contact Lenses.” We added a productNamePlural field to our Products table to enable this change.

  • Different Login Information: The Web version of the store accepts an email address and password as its login information. Both of these pieces of information are not easily expressed in a voice context. We replaced these fields with Account Number and PIN fields, which also necessitated database changes.

It becomes evident that these changes are changes to user-interface elements stored in the data layer. In essence, a product name is really a GUI identifier for the product row. In the voice-only context, these identifiers may not always apply and so may require changes to the database layer.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist