Design and Implement a Voice-only Web Application in ASP.NET : Page 11
This whitepaper demonstrates how to use the Microsoft .NET Speech SDK to build a complete e-commerce starter application. Use these detailed techniques to build your own commerce system that will have your customers browsing, shopping, and making purchases using nothing but the sounds of their voices.
by Paul Osburn
Apr 23, 2003
Page 11 of 13
The standard Text-To-Speech (TTS) engine may work well for development and debugging, but recorded prompts make a voice-only application truly user-friendly. Though the process can be tedious, Microsoft's prompt validation utilities and recording engine make the process easy.
Thorough validation is important to make sure that no prompts are being missed. A few general strategies enabled us to make sure that our prompt generation functions were being validated completely and accurately:
No object-references within prompt functions: Except for calls to PromptGenerator.js, we never make calls to script objects within the body of our prompt functions. Instead, our prompt function arguments are defined so that all function calls are made before the inner prompt function is executed. This avoids errors on validation that prevent prompts from appearing. Example: In Figure 14, note the call to insertSpaces() in the productID variable declaration. A product ID (e.g. "355") must be separated into its component digits to be read correctly by recorded prompts. We make the call to the helper function that does this in the variable declaration and provide an already-formatted version of the productID (e.g. "3, 5, 5") as the validation value.
Figure 14: Create the call to insertSpaces() in the productID variable declaration.
Stand-In Validation Values: When it comes to validation values with a large number of potential values (i.e. numbers, dates, product names, etc.) we always provide a stand-in validation value that represents the entire set for the validator. We then make sure that the entire set is recorded in the prompt database. For instance, by using "Rain Racer 2000" for the product name whenever it is passed into a prompt function, we need only record this one product name for debugging purposes. When the product is ready for testing or professional voice-talent, we then go through and add the rest of the product names.
Achieving Realistic Inflection
The following techniques allow us to make our prompts play as smoothly as possible when reading strings that involve combining many different recordings (i.e., "[This product costs] [two] [dollars and] [fifteen] [cents]"). (Note: throughout this section, individual prompt extractions are identified with brackets, just as they are in the prompt editor.)
Record Extractions in Context: Prompt extractions usually sound more realistic when spoken in context. While it may be tempting to record common single words like, "items," "dollars," and, "products," as individual recordings, they will sound much better when recorded along with the text that will accompany them when they are used in a prompt: "one [item]," "two [items]," etc. In one highly effective example, we recorded all of our large number terms in one recording: "one [million] three [thousand] five [hundred] twenty five dollars."
Recognize and Group Common Word Pairings: When recording singular words like "item," "dollar," and "product," we almost always group them with "one" as they will always be used this way. Our extractions become, "[one item]," "[one dollar]," and "[one product]."
Use Prompt Tags: Although we may have recorded "two" and "thousand" in other extractions, these two words together constitute part of any prompt that includes a current date. So, it makes sense to include, "two thousand," as an individual extraction itself. Further, by using the tags, "YearComplete," and "YearIncomplete," we can record it twice and distinguish between its use in "January First, Two Thousand," where its inflection should drop at the end, and, "January First, Two Thousand Two," where its inflection should rise at the end. We then insert a tag reference in the prompt generation routine. (The following snippet is taken from ConvertYearToWords() in PromptGenerator.js):
Use Display Text To Your Advantage: To achieve high-quality extractions when recording sentences, we modified the display text column of our transcriptions to indicate where the extractions were. As an example, the transcription, "[Order number] 5 8 3 [This order has] one item" has the display text, "Order number, 5 8 3, This order has, one item." Commas are inserted between extractions. During recording, the voice talent can pause at the appropriate places so that the extractions are recorded clearly.
Recording Long Prompts
Although the prompt editor's automatic alignment feature is a powerful tool, it becomes unwieldy when recording lengthy prompts. In CommerceVoice, we needed to record the descriptions of all the products in the product catalog. Each description averaged more than 60 words. In addition, using the alignment tool would require us to update our prompt database each time a description changed in the database, a costly and time-consuming prospect.
Instead of using the alignment engine, we bypassed the issue by aligning the entire description with one alignment: PRODUCT_DESCRIPTION_<PRODUCTID> see Figure15):
Figure 15: Align the entire description with one alignment.
The prompt editor requires that each individual word of the transcription matches an individual alignment within the waveform, so the transcription text must match this alignment (see Figure 16):
Figure 16: The prompt editor requires that each individual word of the transcription matches an individual alignment within the waveform.
Finally, in our prompt functions, we refer to descriptions by outputting this product description keyword, simply the concatenation of the "PRODUCT_DESCRIPTION_" prefix and the product ID:
if (lastCommandOrException == "Description")
text = "Here is the description: PRODUCT_DESCRIPTION_"
Besides the obvious advantage of avoiding large numbers of alignments, we also avoid having to retrieve product descriptions from the database in order to play them. We also separate the voice-only version of the description from the text version, allowing us to make changes to the Website without having to re-record our prompts every time.