Have you ever called up an office without knowing the extension number of the person you’re trying to reach? It can be a nightmare because you have to dial the operator for assistance, who might be busy attending other calls, thereby putting your call on eternal hold. So despite the office having a system with the technology for automatic call transfer, manual intervention is required. The same problem exists with other IVR (Interactive Voice Response) based systems where you have to push a lot of buttons on the telephone keypad to reach the desired service.
|
While IVR systems have their own advantage, and have been deployed worldwide, another technology is gaining ground. That is ASR (Automatic Speech Recognition) systems. This allows customers to interact with a computer using their natural voice instead of pushing buttons on the telephone keypad. For instance, in the office example above, you don’t have to remember the extension number of the person you’re calling. You have to just say the person’s name, and your call is automatically transferred. We’ll look at the technology behind ASR systems.
There are two parts to ASR, one where the speech gets converted to words, and the other where the system derives meaning from these words and produces a specific action, such as responding to the customer. Each is a separate process in itself.
Speech to word
There are two phases in the speech-recognition process, preprocessing and decoding. The first is called feature extraction, where the spoken speech is converted to digital form. A Fast Fourier Transform function is applied to convert the recorded speech to frames. Each frame covers a few milliseconds of speech. It’s converted into a feature vector, which contains the frequency and energy information of each frame. There are different kinds of feature extraction techniques, which include Filterbank, Mel Cepstrum and PLP. These differ based on the application and environment they’ll be used in. The feature vectors are then fed to the decoding process in order to covert them into sentences.
|
The decoding process starts by passing the feature vectors through an Acoustic Model. An Acoustic Model contains information on how various words and sub-words are pronounced. It consists of phonemes, triphones, syllables and whole words. Phonemes by definition are the smallest unit of speech, resembling distinctive parts of spoken words. A Triphone is also a phoneme, but contains information on the phonemes that precede and follow it. An Acoustic Model is made by collecting all varieties of human speech from people of different age, sex, and dialect.
The Acoustic Model is used to map the feature vectors to different sub-word units. Once the sub-words are recognized, they’re mapped to words of an application lexicon using what’s called the Hidden Markov Model or HMM. This is the most widely used statistical model in speech recognition. It uses probability to derive words, and further sentences from the sub-word units. Simply speaking, it starts from a sub-word (current state), and knows all probabilities of which sub-words it could possibly transition to.
HMM is used not only to derive words from sub-words by mapping them against an application lexicon, but also whole sentences by mapping them to a language model. The lexicon is structured like a tree to ease the recognition process. So the recognition of a particular word starts from the tree’s root, which would be the beginning of a sub-word, and ends on a leaf, which is the word’s end. There could be various paths from the root to the leaf, and the best one would be the right word. The most likely path to this leaf is determined by what’s called the Viterbi algorithm. The algorithm can use pruning techniques to remove the improbable paths in the tree.
|
Once words are formed from the application lexicon, they must be passed through a language model to create sentences. A language model is application dependant and contains information on what word sequences are mostly used by the callers. It recognizes sentences by forming word graphs. These are the most probable sentences that could have been spoken by the caller.
Meaning from speech
Once the speech-recognition process forms a word graph form the utterance, it’s passed to the speech understanding module to derive a meaning. This module then parses the word graph using it’s own grammar rules, which are specific to the application it’s being used in, and derives concepts and fillers. Here, concepts are words that match the grammar rules, and fillers are unrecognized words. So if you call up a flight reservation system and say, “I would like go from Delhi to Mumbai”, the speech understanding module will understand “from Delhi to Mumbai”, and tag it as concept, and the rest would be fillers. The combination of concepts and meaningful fillers is formed into a concept graph, and the program calculates the best sentence alternatives for it. Finally, the optimal sentence alternative is derived using statistical techniques.
Anil Chopra