Homemade Speech Recognition with .NETcf
By Casey Chesnut
this article will explain the steps i took to do speech recognition using the Compact Framework. this is entirely from scratch ... it is not using any 3rd party speech recognition APIs. i.e. it is not using VoiceCommand, SapiLite, Micrsoft Speech Server, Microsoft Speech SDK, SALT, my /freeSpeech speech recognition web service, etc... all it does is use the WavIn API on the device to get a WAV file, and then it does all the processing and recognition from there on. the speech recognition is speaker-dependent, requires training, and can be used for command-and-control scenarios. it happens entirely on the device and does not offload processing to a server.
wrote this because i am comfortable developing speech applications, but did not really know what was happening behind the scenes. e.g. i knew how to use SAPI as a tool, but i only had a high level idea of how it actually worked. so this is called /noReco for multiple reasons. first, it is not using any 3rd party speech recognition lib as explained above. second, the first program i wrote as a proof of concept was able to recognize if i said the word 'yes' or 'no'. thus 'no' Reco.
finally, i wrote this article to break out of the web service type cast. dont expect any WS articles from me anytime soon. also, to remind you that i'm a speech guy, and as a stepping stone to becoming an AI guy
the first step was to get my spoken voice into digitized form. this is done through the WavIn API of WindowsMobile. it is not exposed to managed code, but Seth Demsey had already wrapped it for me. OpenNetCF also has a wrapper. with this hooked, i can use the API to record my voice to a Stream from the application. the stream ends up being in the WAV format, which is a waveform digital representation of the word spoken.
the next step was to read the WAV stream. this had mostly been done for me already too: WaveControl. had to extend the code a bit, but it was mostly there. also, made sure that it would work in a CF class lib. in general, it has some header info, and then it is an array specifying the amplitude of sound over time. since CF is missing some drawing capabilities, not to mention the PPC screen is small, i ended up testing out the library on the full framework.
the pic above is rendering the WAV info. the red lines show some minimal processing. the leftmost red line shows where recording begins. from the left to center line, is background noise before i spoke. the center red line and the rightmost red line shows where i am actually speaking ... so i throw the rest of the data out except for that crucial portion.
with the WAV info, i attempted to do speech recognition straight off of that. tried all sorts of processing of the data to find some pattern. the results were that if i said the same word multiple times then the data was significantly different OR if i said entirely different words then ALL of the derived data was exactly the same. ends up that the WAV data in its raw form is not adequate to differentiate between spoken words.
now saying that ... i did manage to get it to work for the most simplest case. determining if i said 'yes' or 'no'. the video below shows that in action
a little searching and i was able to determine that the waveform needs to be transformed into a form that is suitable to determine speech. the common practice is to use a Fourier Transform and create a spectrogram. this is where i admit to getting a C in college-level calculus 2 :( hate math! luckily for me, there was already a C# lib, called Exocortex, which can do fourier transforms. played with that for a while, but ended up finding VB6 code that explicitly created a spectrogram from a WAV. ended up porting that over to C#. in short, the fourier transform shows what frequencies occur at a certain point in time. by viewing the graphical representation of the same word, the graphic ends up looking similar. and if you view the graphic of different words, then they are adequately different. this gave me hope that i could use the transformed data to determine what word was spoken.
by feeding the fourier transform an array of data form the WAV, then it would return an array of transformed data. while the waveform was a linear pattern, the result of concatenating all the fourier transforms was a 2D matrix. each point in the matrix represents the degree of a frequency occurring in the sound at that time. to create the bitmap to view the spectrogram, you just have to normalize that data and plot it. all i did was convert the numerical frequency into a System.Color and then do SetPixel on a Bitmap of that size. it was helpful to have the VB6 program so that i could compare results. the pic below shows the spectrogram graphic below the corresponding waveform. notice how the color blue means that a frequency is not occurring, while red and green show that the frequency level is occurring at that time. the kicker is that if a different WAV was loaded, then its spectrogram would be radically different. and if i spoke that word again, then its spectrogram would look surprisingly similar
at this point, we can suck in the audio and then transform it into its frequency representation. now i needed to generate what i call a 'voice print' for that particular sound. in this case i did matrix sums of the spectrogram. broke the spectrogram up into row and column quadrants. then i would sum all the frequencies in that quadrant to come up with an average. for the entire spectrogram, this would result in an array of averaged frequency values. also used this for training. so that if i spoke the same word 3 times, then it would gen 3 different prints, and then generate a master 'template' averaging all those results. that template would then be used later to compare against.
an application scenario would be somebody using a pocket pc phone edition (or smartphone) with speech reco to call somebody. first, they would train it by selecting somebody from their contacts. next, they would speak that persons name a couple times, and the app would come up with the averaged voice template for that contact.
now that i could store templates of spoken words, i had to come up with some way to compare those templates to incoming ones. since my voice prints were all of identical length, all i did was step through the array and compute the difference between the incoming value and the expected value. then, the stored template that had the lowest difference would be matched as the word that was spoken. this was the simplest alg i could think of, although not the most robust. it actually works great in the speech reco phone dialing app because peoples names are of significant spoken length to make the templates vary. it also worked well for differentiating between digits. it did not do so great with recognizing individual letters of the alphabet, but i can certainly think of more appropriate algs for matching templates with greater precision.
i actually think this would be a great point to introduce a neural network. you could feed it the templates for all the words it has been trained to recognize, and then it would weight its net accordingly. i'm going to attempt to implement a simple neural network in a stand-alone app, and then try to substitute it in here instead of my ad hoc logic
here are some videos of it working. the 1st videos shows creating some templates and matching numbers. the 2nd video has about 15 voice prints of names, and it just shows it matching against them
did not implement this, but it would be trivial to turn this into a voice biometric. instead of recognizing speech by finding the closest match, it could make sure that what you just spoke matches within a set range to what was spoken previously. in this way it does speaker verification instead of speech recognition. if somebody else said your password, then there voice would be different enough from yours to not match, and access would be denied. a side benefit, this would probably keep you from using your PocketPC after you've been drinking ... probably a good thing. tie this with /bioSign for signature recognition and you have 2 behavioral biometrics that an attacker would be hard pressed to bypass. not to mention the encryption you can do with /spCrypt. h3ll, when did i become a security guy?
since this project was mainly a learning exercise for me, i'll go ahead and explain what else is out there
from the above you can see that MS has a number of different speech projects going on. the problem is that we dont have access to these bits on the device, where speech would be quite useful. this is just like my arguments over the last years that MS has a bunch of web service stuff that devices needs, but MS does not support for devices. so just read all my past arguments and replace 'web service' with 'speech'. regardless, we need both of them for their 'seamless computing' initiative.
this worked better than i expected. with a little effort i was able to recreate the speech recognition capabilities that are used by many phones to voice dial today. in general, these devices dont go much (if any) further than what this does ... which is pretty lame. if i could do this in a couple weeks (1 week reading, 1 week coding), i would expect more from the big boys. overall, my goal was met for learning more about speech recognition.
did not find any computer books that could help me attempt this. instead, i had to turn to math-oriented books:
sorry, no source code. explained the steps i took above, so you should be able to recreate them. i've decided not to give out source code for the entirety of this year. have been handing out chunks of code for the last 3 some years. the end result, is that i've not been happy with the career results. now i am going to try a different model and see if anything different happens. basically i'm going to keep doing cool projects and writing about them, but it will take more effort on your part to keep up. after this year, i'll reevaluate and decide at that time.
will probably revisit the AI portion of this once i get some more of that under my belt. otherwise i have no plans to extend this and would rely on 3rd party products.
still struggling to learn AI. actually have a bunch of ideas of stuff to code right now or in the near future. maybe a spin-off article (in)directly related to this one. later
Article content copyright © Casey Chesnut, 2004.
All content copyright © 1998-2007, Generation5 unless otherwise noted.