Advertisment

Behind the Scenes

author-image
PCQ Bureau
New Update

Speech recognition is the hottest topic in research today. In

fact, many full-blown speech recognition applications are being implemented in

the West to increase work efficiency.

Advertisment

Speech recognition has evolved quite a bit over the past few

years. Initially, it used to work in discrete dictation mode, where you had to

pause between each spoken word. Today, however, it uses continuous dictation. It’s

also become smarter, with its own set of grammar rules to make out the meaning

of what’s being said.

Speech recognition uses several techniques to

"recognize" the human voice. It functions as a pipeline that converts

digital audio signals coming from the sound card to recognized speech. These

signals pass through several stages, where various mathematical and statistical

methods are applied to figure out what is actually being said.

Let’s take a look at how it works.

Advertisment

The voice input to the microphone goes to the sound card. The output from the sound card–digital audio–is processed using FFT (Fast Fourier Transform)–and further fine-processed using HMMs and other techniques. The built-in database is used for analyzing what’s been spoken. There’s a reverse feedback to the database at the final stage for the purpose of adaptation. The final recognized output then goes back to the CPU

The voice input to the microphone goes to the sound card. The output

from the sound card–digital audio–is processed using FFT (Fast

Fourier Transform)–and further fine-processed using HMMs and other

techniques. The built-in database is used for analyzing what’s been

spoken. There’s a reverse feedback to the database at the final stage

for the purpose of adaptation. The final recognized output then goes

back to the CPU

Sounds simple

First, you–the user–give a voice command over the

microphone, which is passed to the sound card in your system. This analog signal

is sampled 16,000 times a second and converted into digital form using a

technique called Pulse Code Modulation or PCM. This digital waveform is a stream

of amplitudes that look like a wavy line. The speech recognition software can’t

figure out anything from this stream–it first has to translate it into

something it can easily recognize. So, it converts this signal into a set of

discrete frequency bands using a technique called Windowed Fast Fourier

Transform (FFT). For this, the audio signal is further sampled every 1/100th of

a second and each sample is converted into a particular frequency. So, the

incoming stream is now a set of discrete frequency bands, in a form that can be

used by the speech recognizer.

Advertisment

The next stage involves recognizing these bands of

frequencies. For this, the speech recognition software has a database containing

thousands of frequencies or "phonemes", as they’re called. A phoneme

is the smallest unit of speech in a language or dialect. The utterance of one

phoneme is different from another, such that if one phoneme replaces another in

a word, the word would have a different meaning. For example, if the

"b" in "bat" were replaced by the phoneme "r", the

meaning would change to "rat". The phoneme database is used to match

the audio frequency bands that were sampled. So, for example, if the incoming

frequency sounds like a "t", the software will try and match it to the

corresponding phoneme in the database. Each phoneme is tagged with a feature

number, which is then assigned to the incoming signal.

Figuring out the right sound

If life were simple, each incoming frequency band would find

the right phoneme in the database. The software would then collate these to form

words, and your PC would understand you. Unfortunately, it isn’t that simple.

There can be so many variations in sound due to how words are spoken that it’s

almost impossible to exactly match an incoming sound to an entry in the

database. For example, the "t" in "the" sounds different

from the "t" in, say "table". Not only that, but different

people would pronounce the same word differently. To make matters worse, the

environment also adds its own share of noise. Therefore, the software has to use

complex techniques to approximate the incoming sound and figure out which

phonemes are being used.

Advertisment

Training the software

One way of identifying phonemes is to "train" the

speech recognition software. In training, many variations of the same phoneme

are given, and the software analyzes each of these through statistical methods.

Let’s see how it recognizes one phoneme.

Compared to the sampling frequency of 1/100th of a second,

the duration of one phoneme is long. During this time, many frequency bands

would actually pass the speech recognizer, and each would be assigned a feature

number. So, the software uses statistics to figure out the probability of a

particular feature number appearing in a phoneme. The feature number with the

highest probability would correspond to the phoneme that you’ve spoken. This

way, it gathers data of the hundreds of variations of the same phoneme passed to

it, and approximates the right one.

Advertisment

Other techniques

This is just the tip of the iceberg–a small sample of how

speech recognition software recognizes sounds. There are many other complexities

involved in recognizing sound. For example, the software has to be able to judge

when a phoneme ends and the next one begins. For this, it uses a technique

called Hidden Markov Models (HMM), which is another mathematical model that uses

statistics. To figure out when speech starts and stops, a speech recognizer has

silence phonemes, which are also assigned feature numbers.

There are also some phonemes that depend upon what comes

before or after them. For example, consider two words, "see" and

"saw". Here the vowels "ee" and "aw" intrude into

the phoneme "s". You hear the vowels for a longer period than the

"s". To solve this problem, speech recognition software uses

tri-phones, or phonemes produced along with the surrounding phonemes.

Advertisment

In another technique called pruning, for a particular speech,

the software generates several hypotheses on what could have been spoken. It

then generates scores for each hypothesis and the one with the highest score is

taken. The ones with the lower scores are "pruned" out.

This is the essence of how speech recognition works, though

there are lots of other complexities involved. The technology holds great scope

for the future.

Ankur Saxena

Advertisment