Microsoft Speech Application SDK 1.1
In 2004, Microsoft released Microsoft Speech Server
along with a free SDK
that lets you develop Web-based speech applications that run on Speech Server. You can use the SDK to build telephony or voice-only applications in which the computer-to-user interaction is done using a telephone. You can also build multimodal applications in which users can choose between using speech or traditional Web controls as input.
The Microsoft text-to-speech engine synthesizes text by first breaking down the words into phonemes. Phonemes are the basic elements of human language. They represent a set of "phones
," which are the sounds that form words. The text-to-speech engine then analyzes the extracted phonemes and converts them to symbols used to generate the digital audio speech.
You can use the downloadable sample application
) that accompanies this article to experiment with configurable aspects of the Microsoft text-to-speech engine. The multimodal application contains one Web page (see Figure 2
) into which you enter some text. You can then click a button to hear the text read in one of the following ways:
|Author's Note: In cases where the text to be spoken is not known ahead of time, using a text-to-speech engine is unavoidable; however you can generally get better quality from recorded audio. When audio quality is critical, you can use the Microsoft Speech Application Software Development Toolkit (SASDK) to record audio. For example, you may want to use recorded audio to prompt users for information. The recorded audio can be broken out into a series of prompts that are concatenated together at runtime.
|Phonemes are the basic elements of human language. They represent a set of "phones," which are the sounds that form words.|
- Speak Text NormallyProvides a benchmark for testing
- Say as an AcronymThe text, "ASP" is spoken as, "A.S.P."
- Say as NameMr. John Doe is pronounced as "Mister John Doe"
- Say As DateIn this case, date is formatted as month, day, year
- Say as Web AddressIn this case, the text is formatted as a Universal Resource Identifier (URI)
- Say as DigitsA number entered as text is spoken as a series of digits
- High Pitch/Slow RateThe text is read with a high pitch and a slow rate
- Rate Fast/Volume LoudThe text is read with a fast rate and loud volume
- Low Pitch/Volume SoftThe text is read with a low pitch and a soft volume
|Figure 2. Sample Application: You can use this multimodal application to hear text spoken in a variety of ways.|
Multimodal applications use a prompt control to specify audio that is played to a user. The prompt control contains an InlineContent
property that may contain either a Content or a Value Basic Speech Control. The Content control specifies a specific prompt file containing stored audio recordings. The Value control specifies elements from an HTML Web page. The sample application uses a Value control that references the input element named txtText
(the "Type some text here:" field in Figure 2
). Here's the HTML that represents the markup for a prompt:
<speech:prompt id="prmText" runat="server">