It seems enabling voice is the ultimate challenge and the ultimate reward for both technologists and users.
Why is voice so popular? Voice comes more naturally to man than other modes of communication, like pen and paper or computer-based applications. Also, using voice requires no special training.
Voice of www
Considering the huge advantages voice offers, the World Wide Web is moving toward getting itself voice-enabled. And VoiceXML is the language it has adopted. VoiceXML emerged in 2000, through the collaborative efforts of AT&T, IBM, Lucent Technologies and Motorola, who got together and founded the VoiceXML Forum (www.voicexml.org). The Forum defines VoiceXML as a “Web-based markup language for representing human-computer dialogs”.
Existing Web services, which are restricted through a Web browser can now be accessed through a telephone using VoiceXML. Good candidates to benefit from such an implementation would include information-intensive websites that categorize bits of information and serve it up over the Internet.
VoiceXML in its present form can handle the following.
- Synthesized speech output (text-to-speech)
- Output of audio files
- Recognition of spoken input
- Recognition of DTMF input
- Recording of spoken input
- Telephony features like call transfer and disconnect
The technology
VoiceXML is an extensible markup language whose primary application is in the area of ASR (Automated Speech Recognition) and IVR (Interactive Voice Response) systems. The architecture of VoiceXML consists of the following components.
Document server: The document server (which can be any Web-server) processes requests from a client application and serves up VoiceXML documents.
VoiceXML interpreter context: The interpreter context is what reads the first VoiceXML documents and answers a call. It also monitors the caller input and executes events according to the VoiceXML document, with the VoiceXML interpreter.
VoiceXML interpreter: This sits on the client machine and processes requests from the document server with the help of the VoiceXML interpreter context. It processes the commands in the VoiceXML document and plays the prompts, listens for responses, matches them against the ‘grammar’ of the VoiceXML document and executes the application’s logic.
Implementation platform: This includes the telephone hardware and related IVR and ASR resources that are controlled by the VoiceXML interpreter and the VoiceXML interpreter context. The implementation platform generates events in response to caller actions (for example, touch tone or spoken commands) and executes system events (for example, timers expiring).
|
How it works
A caller calls a phone number for a Web service that is VoiceXML-enabled. This call is routed to a VoiceXML interpreter, which works with the interpreter context to retrieve a VoiceXML document from the Web server and plays a pre-recorded or TTS (Text-To-Speech) generated audio prompt to the caller.
The caller can now select a service/option by speaking it out or pressing appropriate keys to generate a DTMF tone. Speech responses have to be handled by an ASR (Automated Speech Recognition) system, which executes commands in the VoiceXML document based on the grammar contained in it. The interpreter then executes the commands in the document based on what the ASR returns. This continues till the caller hangs up, the application terminates, or both. We’ll look at an example on the next page. Before that, let us look at the way one can write a VoiceXML application and what are its basic components.
The building blocks
Session: This is the entire caller-computer conversation, which starts when a call is put through, and ends with the caller hanging up or the VoiceXML document (or the interpreter context) requesting it to end.
Dialog states: A set of named dialog states make up a VoiceXML application. The user passes from one dialog state to another and each dialog leads to the next. These are written in plain text documents with the extension
.vxml.
Forms: VoiceXML dialogs include forms and menus. Each form has a name and is responsible for executing some portion of the dialog. It defines an interaction that collects values for each of the fields in the form. Forms are submitted to a server just like HTML forms.
Menus: A menu presents the user with a choice of options and defines the transition to another dialog state depending on the user’s selection.
Fields: A form has fields. These fields specify either the prompt, the expected input or the evaluation rules of the caller’s input.
Application: A set of VoiceXML documents is an application. These documents must share the same application root document.
Grammar: Grammar is used to describe the expected user input, either spoken or touch-tone
(DTMF) key presses. Each dialog state has one or more grammar associated with it.
Sub-dialog: A sub-dialog is akin to a sub-routine, which lets the control pass to a new dialog and then return to the original retaining the local state information for that dialog.
Variables: Named variables can be used to hold data. These can be defined at any level (from the session down to a dialog) and their scope follows an inheritance model. Variable expressions can also be used for conditional prompts or grammar or
both.
Events: Events are like exceptions during a conversation. These arise out of unclear (to the VoiceXML application) user responses or no responses. Events can be caught by writing event-handlers and follow an inheritance model.
Dynamic VoiceXML(Scripting): ECMAScripts (VBScript, JavaScript) can be used to add more control to a VoiceXML application.
Writing VoiceXML
It is recommended to start all VoiceXML
applications with the XML version tag just like in
XML documents.
Next should be the vxml tag with the version attribute set to the VoiceXML version being used.
< vxml version=”1.0”>
Forms
The form has to be named, ideally according to what dialog element it is responsible for executing. A form is denoted by the use of the
< form id=”hello”>
This form will contain several elements–the “form items”, which can be field items or control items. Field items gather information from the caller to fill variables and may contain prompts guiding the caller about what to say, the grammar that defines the interpretation of what is said and any event handlers.