| |||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||
|
How does a computer convert spoken speech into data that it can then manipulate or execute? Well, from a general perspective, what has to be done? Initially, when we speak, a microphone converts the analog signal of our voice into a digital chunks of data that the computer must analyze. It is from this data that the computer must extract enough information to confidently guess the word being spoken. This is no small task! In fact, in the early 1990s, the best recognizors were yielding a 15% error rate on a relatively small 20,000 word dictation task. Now though, that error percentage has dropped to as low as 1-2%, although this can vary greatly between speakers. So, how is it done?
Step 1: Extract PhonemesPhonemes are best described as linguistic units. They are the sounds that group together to form our words, although quite how a phoneme converts into sound depends on many factors including the surrounding phonemes, speaker accent and age. Here are a few examples:
English uses about 40 phonemes to convey the 500,000 or so words it contains, making them a relatively good data item for speech engines to work with.
Extracting PhonemesPhonemes are often extracted by running the waveform through a Fourier Transform. This allows the waveform to be analyzed in the frequency domain. Well, what does this mean? It is probably easier to understand this principle by looking at a spectrograph. A spectrograph is a 3D plot of a waveform's frequency and amplitude versus time. In many cases though, the amplitude of the frequency is expressed as a colour (either greyscale, or a gradient colour). Below is the spectrograph of me saying "Generation5":
As you can see, it is relatively easy to match up the amplitudes and frequencies of a template phoneme with the corresponding phoneme in a word. For computers, this task is obviously more complicated but definitely achievable.
Step 2: Markov ModelsNow that the computer generates a list of phonemes, what happens next? Obviously these phonemes have to be converted into words and perhaps even the words into sentences. How this occurs can be very complicated indeed, especially for systems designed for speaker-independent, continuous dictation.However, the most common method is to use a Hidden Markov Model (HMM). The theory behind HMMs is complicated, but a brief look at simple Markov Models will help you gain an understanding of how they work. Basically, think of a Markov Model (in a speech recognition context) as a chain of phonenes that represent a word. The chain can branch, and if it does, is statistically balanced. For example:
Recognize speech Wreck a nice beachThese two phrases are surprisingly similar, yet have wildly different meanings. A program using a Markov Model at the sentence level might be able to ascertain which of these two phrases the speaker was actually using through statistical analysis using the phrase that preceded it. For more information on Markov Models, see the Generation5 introductory essay.
ConclusionThis essay hopefully gave you a decent overview of how speech recognition works. The stress is on the word overview - speech technologies are quickly moving forward, and the algorithms and methods described in this essay are being greatly optimized and improved.With the advent of intelligent, filtering microphones and near-perfect speech-recognition, we will hopefully see a new era of human-computer interaction evolve.
Submitted: 23/10/2002 Article content copyright © James Matthews, 2002.
|
|
||||||||||||||||||||||||||||||
All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -