by April 12, 2002 0 comments

Have you ever called up an office without knowing the extension number of the person you’re trying to reach? It can be a nightmare because you have to dial the operator for assistance, who might be busy attending other calls, thereby putting your call on eternal hold. So despite the office having a system with the technology for automatic call transfer, manual intervention is required. The same problem exists with other IVR (Interactive Voice Response) based systems where you have to push a lot of buttons on the telephone keypad to reach the desired service.

Speech recognition in 10 steps

  1. Caller says something
  2. Speech digitized into frames using Fast Fourier Transform
  3. Frames converted to Feature Vectors
  4. Acoustic Modeling maps feature vectors to sub-word units
  5. Sub-words mapped to Application Lexicon to form words
  6. Words mapped to Language Model to form word graphs
  7. Word graph passed to speech understanding module
  8. Word graph is parsed with related grammar to find matching concepts
  9. Concept graph is created consisting of the most probable concepts and fillers
  10. Sentence alternatives derived from the concept graph and best one drawn

While IVR systems have their own advantage, and have been deployed worldwide, another technology is gaining ground. That is ASR (Automatic Speech Recognition) systems. This allows customers to interact with a computer using their natural voice instead of pushing buttons on the telephone keypad. For instance, in the office example above, you don’t have to remember the extension number of the person you’re calling. You have to just say the person’s name, and your call is automatically transferred. We’ll look at the technology behind ASR systems.

There are two parts to ASR, one where the speech gets converted to words, and the other where the system derives meaning from these words and produces a specific action, such as responding to the customer. Each is a separate process in itself.

Speech to word
There are two phases in the speech-recognition process, preprocessing and decoding. The first is called feature extraction, where the spoken speech is converted to digital form. A Fast Fourier Transform function is applied to convert the recorded speech to frames. Each frame covers a few milliseconds of speech. It’s converted into a feature vector, which contains the frequency and energy information of each frame. There are different kinds of feature extraction techniques, which include Filterbank, Mel Cepstrum and PLP. These differ based on the application and environment they’ll be used in. The feature vectors are then fed to the decoding process in order to covert them into sentences.

ASR Solutions from Philips

Philips has been an active player in the field of implementing commercial ASR solutions. It has two basic products in the field, known as SpeechPerl and SpeechMania. The former is a software solution that integrates with IVR platforms, while the latter is a complete solution offering speech recognition, understanding, dialogue control and speech output. For more, check out or contact

The decoding process starts by passing the feature vectors through an Acoustic Model. An Acoustic Model contains information on how various words and sub-words are pronounced. It consists of phonemes, triphones, syllables and whole words. Phonemes by definition are the smallest unit of speech, resembling distinctive parts of spoken words. A Triphone is also a phoneme, but contains information on the phonemes that precede and follow it. An Acoustic Model is made by collecting all varieties of human speech from people of different age, sex, and dialect.

The Acoustic Model is used to map the feature vectors to different sub-word units. Once the sub-words are recognized, they’re mapped to words of an application lexicon using what’s called the Hidden Markov Model or HMM. This is the most widely used statistical model in speech recognition. It uses probability to derive words, and further sentences from the sub-word units. Simply speaking, it starts from a sub-word (current state), and knows all probabilities of which sub-words it could possibly transition to.

HMM is used not only to derive words from sub-words by mapping them against an application lexicon, but also whole sentences by mapping them to a language model. The lexicon is structured like a tree to ease the recognition process. So the recognition of a particular word starts from the tree’s root, which would be the beginning of a sub-word, and ends on a leaf, which is the word’s end. There could be various paths from the root to the leaf, and the best one would be the right word. The most likely path to this leaf is determined by what’s called the Viterbi algorithm. The algorithm can use pruning techniques to remove the improbable paths in the tree.

Mac recognition

Apple speech recognition makes working on the machine a very easy task. The Mac does a speaker independent and continuous speech recognition. This means that speech commands are executed independent of the speakers’ accent or nominal voice variations and also you need not train the system. You can just talk naturally to it.

The Speech Control Panel allows the settings for how your Mac will listen to your commands. It can listen continuously or be set to listen only on a key press, like ‘Esc’. You can also set it to listen for a name before any command to prevent it from accidentally responding to surrounding voices in a continuous listening mode. While ‘Speakable items’ are on, a feedback window on your desktop monitors the commands and shows you the ‘recognized’ sounds along with an animated character (which can be
Speakable Commands can be accessed from the Apple Menu or can be invoked by saying Show me what to say. These items are divided into three types: built-in commands that are always available, global commands that work when any application is active, application -specific commands.

The list of commands cannot be given here, but it is an exhaustive collection of commands to make your work a lot more fun. And what’s more! You can even add words to its vocabulary. To make any application/alias
speakable, just click on it and tell your Mac to Make this application
speakable. The next time you need to start the application, just say its name!

Once words are formed from the application lexicon, they must be passed through a language model to create sentences. A language model is application dependant and contains information on what word sequences are mostly used by the callers. It recognizes sentences by forming word graphs. These are the most probable sentences that could have been spoken by the caller.

Meaning from speech
Once the speech-recognition process forms a word graph form the utterance, it’s passed to the speech understanding module to derive a meaning. This module then parses the word graph using it’s own grammar rules, which are specific to the application it’s being used in, and derives concepts and fillers. Here, concepts are words that match the grammar rules, and fillers are unrecognized words. So if you call up a flight reservation system and say, “I would like go from Delhi to Mumbai”, the speech understanding module will understand “from Delhi to Mumbai”, and tag it as concept, and the rest would be fillers. The combination of concepts and meaningful fillers is formed into a concept graph, and the program calculates the best sentence alternatives for it. Finally, the optimal sentence alternative is derived using statistical techniques.

Anil Chopra

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.