Speech is an interactive process of prompts and commands. Semantic Markup Language Grammars are the set of structured command rules that identify words, phrases, and valid selections that are collected in response to an application prompt. Grammars provide both the exact words and the order in which the commands can be said by application users. A grammar can consist of a single word, a list of acceptable words or complex phrases. Structurally it's a combination of XML and plain text that is the result of attempting to match the user responses Within MSS, this set of data conforms to the W3C Speech Recognition Grammar Specification (SRGS). An example of a simple grammar file that allows for the selection of a sandwich is shown below:
<rule id="sandwich" scope="public">
<tag>$._value = "ham"</tag>
<tag>$._value = "roast beef"</tag>
<tag>$._value = "italian"</tag>
Grammars form the guidelines that applications must use to recognize the possible commands that a user might issue. Unless the words or phrases are defined in the grammar structure, the application cannot recognize the user's speech commands and returns an error. You can think of grammar as a vocabulary of what can be said by the user and what can be understood by the application. This is like a lookup table in a database that provides a list of options to the user, rather than accepting free-form text input.
|Figure 12. Grammar Files: Within a Visual Studio speech application, grammar files have a .grxml extention and are added directly to the project.|
A very simple application can limit spoken commands to a single word like "open" or "print." In this case, the grammar is not much more than a list of words. However, most applications require a richer set of commands and sentences. The user interacting with this type of speech application expects to use a normal and natural language level. This increases the expectation for any application and requires additional thought during design. For example, an application must accept, "I would like to buy a roast beef sandwich," as well as, "Gimme a ham sandwich."
A well-defined grammar provides a bit more functionality than that, of course. It won't just define the options, but also the additional phrases such as a preamble to a sentence. For example, the grammar corresponding to the question above must also recognize "I would like to" in addition to the option "roast beef." So given this, the grammar is essentially a sentence or sequence of phrases broken down into their smallest component parts.
Another job of the grammar is to map multiple similar phrases to a single semantic meaning. Consider all the ways a user can ask for help. The user may say "help," "huh," or "what are my choices." Ultimately, however, in all three cases the user is asking for help. The grammar is responsible for defining all three phrases and maps them to a single set of options. The benefit is that a developer only has to write the code to deal with the phrase "help."
|Figure 13. Grammar Editing Tool: The Grammar Editing Tool is used to manage and define the various elements of the speech prompts.|
Within a Visual Studio speech application, grammar files have a .grxml
extension and are added independently as shown in Figure 12
. Once added to a project, the Grammar Editing Tool, as shown in Figure 13
, is used to add and update the independent elements. This tool is designed to provide a graphical layout using a left to right view of the phrases and rules stored in a particular grammar file. Essentially, it provides a visualization of the underlying SRGS format, in a word graph rather than the hierarchical XML.
For developers, the goal of the Grammar Editor is to present a flowchart of the valid grammar paths. A valid phrase is defined by a successful path through this flowchart. Building recognition rules is done by dragging the set of toolbox elements listed in Table 2 onto the design canvas. The design canvas displays the set of valid toolbox shapes and represents the underlying SRGS elements.
Table 2: The elements of the Grammar Editor toolbox.
The phrase element represents a single grammatical entry.
The list element specifies the relationship between a group of phrases.
The group element binds a series of phrases together in a sequence.
The rule reference element provides the ability to reference an external encapsulated rule.
The script tag element defines the set of valid phrases for this grammar.
The wild card element allows any part of a response to be ignored.
The skip element creates an optional group that can be used to insert or format semantic tags at key points in the grammar
The halt element immediately stops recognition when it is encountered.
During development the Grammar Editor provides the ability to show both the path of an utterance and the returned SML document as shown in Figure 14
. For example, the string, "I would like to buy a ham sandwich" is entered into the Recognition string text box at the top and the path the recognizer took through the grammar is highlighted. At the bottom of the screen the build output window displays a copy of the SML document returned by the recognizer. This feature provides an important way to validate and test that both the grammar and SML document returned are accurate.
|Figure 14. Grammar Editor: During development the grammar editor provides the ability to show both the path of an utterance and the returned SML within the Visual Studio environment.|
Structurally the editor provides the list of rules that identify words or phrases that an application user is able to provide. A rule defines a pattern of speech input that is recognized by the application. At run time the speech engine attempts to find a complete path through the rule using the supplied voice input. If a path is found the recognition is successful and results are returned to the application in the form of an SML document. This is an XML-based document that combines the utterance, semantic items, and a confidence value defined by the grammar as shown below.
<SML confidence="1.000" text="ham"
The confidence value is a score returned by the recognition engine that indicates the degree of confidence it has in recognizing the audio. Confidence values are often used to drive the confirmation logic within an application. For example, you may want to trigger a confirmation answer if the confidence value falls below a specific threshold such as .8.
The SASDK also includes the ability to leverage other data types as grammar within an application. The clear benefit is that you don't have to manually author every specific grammar rule. Adding these external grammars can be done through an included Grammar Library or using a process called data-driven grammar.
The Grammar Library is a reusable collection of rules provided in SRGS format that are designed to cover a variety of basic types. For example, this includes grammar for recognizing numbers and mapping holiday dates to their actual calendar dates. Data-driven grammar is a feature provided by three Application Speech controls. The ListSelector and DataTableNavigator controls enable you to take SQL Server data, bind it to the control, and automatically make all the data accessible by voice. Logically this means that you don't have to recreate all the data stored in a database into a grammar file. The third control, the AlphaDigit control, isn't a data-bound control. Rather, it automatically generates a grammar for recognizing a masked sequence. For example, the mask "DDA" would recognize any string following the format: digit, digit, character.