At the forefront of Artificial Intelligence
  Home Articles Reviews Interviews JDK Glossary Features Discussion Search

How Does Speech Recognition Work?

How does a computer convert spoken speech into data that it can then manipulate or execute? Well, from a general perspective, what has to be done? Initially, when we speak, a microphone converts the analog signal of our voice into a digital chunks of data that the computer must analyze. It is from this data that the computer must extract enough information to confidently guess the word being spoken.

This is no small task! In fact, in the early 1990s, the best recognizors were yielding a 15% error rate on a relatively small 20,000 word dictation task. Now though, that error percentage has dropped to as low as 1-2%, although this can vary greatly between speakers.

So, how is it done?

Step 1: Extract Phonemes

Phonemes are best described as linguistic units. They are the sounds that group together to form our words, although quite how a phoneme converts into sound depends on many factors including the surrounding phonemes, speaker accent and age. Here are a few examples:

aafather
aecat
ahcut
aodog
awfoul
ngsing
t talk
ththin
uhbook
uwtoo
zhpleasure

English uses about 40 phonemes to convey the 500,000 or so words it contains, making them a relatively good data item for speech engines to work with.

Extracting Phonemes

Phonemes are often extracted by running the waveform through a Fourier Transform. This allows the waveform to be analyzed in the frequency domain. Well, what does this mean? It is probably easier to understand this principle by looking at a spectrograph. A spectrograph is a 3D plot of a waveform's frequency and amplitude versus time. In many cases though, the amplitude of the frequency is expressed as a colour (either greyscale, or a gradient colour). Below is the spectrograph of me saying "Generation5":

As a comparison, here is another spectrograph of the "ss" bit of assure (this is a phoneme):

Using this, can you see where in "Generation5" the "sh" of Generation5 comes in the spectrograph? Note that the timescales are slightly different on the two spectrographs, so they look a little different.

As you can see, it is relatively easy to match up the amplitudes and frequencies of a template phoneme with the corresponding phoneme in a word. For computers, this task is obviously more complicated but definitely achievable.

Step 2: Markov Models

Now that the computer generates a list of phonemes, what happens next? Obviously these phonemes have to be converted into words and perhaps even the words into sentences. How this occurs can be very complicated indeed, especially for systems designed for speaker-independent, continuous dictation.

However, the most common method is to use a Hidden Markov Model (HMM). The theory behind HMMs is complicated, but a brief look at simple Markov Models will help you gain an understanding of how they work.

Basically, think of a Markov Model (in a speech recognition context) as a chain of phonenes that represent a word. The chain can branch, and if it does, is statistically balanced. For example:

Note that this Markov Model represents both the American English and the (real) English methods of saying the word "tomato". In this case, the model is slightly biased towards the English pronounciation. This idea can be extended up to the level of sentences, and can greatly improve recognition. For example:
Recognize speech
Wreck a nice beach
These two phrases are surprisingly similar, yet have wildly different meanings. A program using a Markov Model at the sentence level might be able to ascertain which of these two phrases the speaker was actually using through statistical analysis using the phrase that preceded it.

For more information on Markov Models, see the Generation5 introductory essay.

Conclusion

This essay hopefully gave you a decent overview of how speech recognition works. The stress is on the word overview - speech technologies are quickly moving forward, and the algorithms and methods described in this essay are being greatly optimized and improved.

With the advent of intelligent, filtering microphones and near-perfect speech-recognition, we will hopefully see a new era of human-computer interaction evolve.

Submitted: 23/10/2002

Article content copyright © James Matthews, 2002.
 Article Toolbar
Print
BibTeX entry

Search

Latest News
- Generation5 10-year Anniversary (03/09/2008)
- New Generation5 Design! (09/04/2007)
- Happy New Year 2007 (02/01/2007)
- Where has Generation5 Gone?! (04/11/2005)
- NeuroEvolving Robotic Operatives (NERO) (25/06/2005)

What's New?
- Back-propagation using the Generation5 JDK (07/04/2008)
- Hough Transforms (02/01/2008)
- Kohonen-based Image Analysis using the Generation5 JDK (11/12/2007)
- Modelling Bacterium using the JDK (19/03/2007)
- Modelling Bacterium using the JDK (19/03/2007)


All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -