At the forefront of Artificial Intelligence
  Home Articles Reviews Interviews JDK Glossary Features Discussion Search

SAPI 5.0 Tutorial I: An Introduction to SAPI

SAPI 5.0 is the latest incarnation of Microsoft's Speech API - allowing programmers to add speech recognition and synthesis into their programs. It is based on a COM interface, making it usable in a variety of COM-supporting languages (VC++, VB, Java, Delphi etc). You many download SAPI or order a CD through Microsoft at Microsoft's Speech site. Note that you will also need the latest Platform SDK.

This tutorial will look at setting up Microsoft Visual C++ to allow programming with the SAPI SDK, as well as an introduction to the basics of SAPI and the grammar. The program we will create is a simple dialog-based application (MFC) that can draw 3 different shapes in 7 different colours, as well as a few additional commands. You might want to download the code now:

Setting up MSVC++

Once you have installed SAPI 5.0, you should configure Visual C++ to automatically include the SAPI directories for its header files, libraries and tools. To do this, start up Visual C++ and select "Tools, Options, Directories". Make sure they include directory setup is showing. Now, add a new directory and point it to the SAPI 5.0 include directory. My directory setup looked like this:

Now, do the same thing for the "Libraries" ('lib\i386') and "Executables" ('bin').

Setting up the Help File

This is not essential, but I find it useful - adding the SAPI 5.0 Help to the help menu. Go to "Tools, Customize" and add a new Tool. Type "SAPI 5.0 Help" as the name, and "hh" (HTMLHelp) as the program. Enter 'C:\Program Files\Microsoft Speech SDK5.0\Docs\Help\sapi5sdk.chm' as the argument (or whatever your directory is). Now, you have a tool to bring up the help file. You can either leave it where it is, add it as a toolbar button, or add it to the Help menu. Now, to create our program. This is a hack, so if people have found a nicer way, please tell me.

Adding SAPI Support

Create a dialog-based application using the MFC AppWizard. Before anything, open up your main application class. Now at the top of InitInstance add:
	AfxOleInit();
This allows the dialog to use COM objects. Now, in your dialog class definition add the following include:
#include <sphelper.h>
Now, add the following three variables to a protected interface of your dialog class:
	CComPtr<ISpRecognizer>  g_cpEngine;
	CComPtr<ISpRecoContext> g_cpRecoCtxt;
	CComPtr<ISpRecoGrammar> g_cpCmdGrammar;
These are the engine, recognition context and grammar variables respectively. Confused yet? Not to worry, I am just quickly presenting all the "necessities", and once we look at how to initialize SAPI, we will take a decent look at the individual components.

We will be moving on to the dialog implementation file to add our event handler, but finally we'll add our handler declaration to our class header. When SAPI handles speech recognition, it will send a message to a window (normally called WM_RECOEVENT) just like other Windows controls. As always, for non-ClassWizard support messages, we must declare our event handler outside of the }}AFX_MSG block:

	//{{AFX_MSG(CSapiTutorial0Dlg)
	virtual BOOL OnInitDialog();
	afx_msg void OnSysCommand(UINT nID, LPARAM lParam);
	afx_msg void OnPaint();
	afx_msg HCURSOR OnQueryDragIcon();
	afx_msg void OnDestroy();
	//}}AFX_MSG
	afx_msg LRESULT OnRecoEvent(WPARAM, LPARAM);	// SAPI handler here
	DECLARE_MESSAGE_MAP()

Now, open up the dialog implementation file. We need to add a little stuff at the top of the file, just above the CAboutDlg declaration:

#include "sapi0.h"

#define WM_RECOEVENT	WM_USER+1
The first include is a local include that doesn't exist yet, but it will do soon! It is automatically generated by one the grammar compiler (gc.exe) and is used to handle various events. The next line defines WM_RECOEVENT to make our later code more readable.

Finally, we need to set up the message handler for our WM_RECOEVENT message, so outside of the AFX_MSG_MAP but before the END_MESSAGE_MAP() macro add:

	ON_MESSAGE(WM_RECOEVENT, OnRecoEvent)
OnRecoEvent is the handler for WM_RECOEVENT and will receive all the messages. Now, get comfortable because we are about to look at SAPI-specific code.

Initializing SAPI

Add a function to the dialog class called InitializeSapi that returns a boolean value. Now the fun stuff:
bool CSapiTutorial0Dlg::InitializeSapi() {
	if (FAILED(CoInitialize(NULL))) {
		AfxMessageBox("Error starting COM");
		return false;
	}
Firstly, we must initialize COM before anything else. If there is a problem, we will return a false value allowing the calling function to handle errors.
	HRESULT hRes = g_cpEngine.CoCreateInstance(CLSID_SpSharedRecognizer);
        
	if (FAILED(hRes)) {
		AfxMessageBox("Error starting SAPI");
		return false;
	}
We now create the engine, using a shared recognizer. This basically means other applications will be able to use the recognizor simultaneously - this is the optimal setting for most applications.
	hRes = g_cpEngine->CreateRecoContext(&g_cpRecoCtxt);

	if (FAILED(hRes)) {
		AfxMessageBox("Error creating context");
		return false;
	}
A recognition context is a 'view' that the speech recognition engine (SRE) will use. An application can have multiple recognition contexts to handle different situations. For example, if an application has a multi-document interface, then each MDI frame could use its own context. An application must have at least one context to utilize the SRE.
	hRes = g_cpRecoCtxt->SetNotifyWindowMessage(m_hWnd, WM_RECOEVENT, 0, 0);
	
	if (FAILED(hRes)) {
		AfxMessageBox("Error creating notification window");
		return false;
	}
We are setting which window will receive the necessary notifications from the SRE (in this case, the main dialog window). Note that we defined WM_RECOEVENT earlier.
	hRes = g_cpRecoCtxt->SetInterest(SPFEI(SPEI_RECOGNITION), SPFEI(SPEI_RECOGNITION));

	if (FAILED(hRes)) {
		AfxMessageBox("Error creating interest...seriously");
		return false;
	}
Here we are setting the "interest" of the application. This sounds strange, but all this means is that we are not interested in the many messages SAPI might send to the application, just recognition events.
	hRes = g_cpRecoCtxt->CreateGrammar(0, &g_cpCmdGrammar);
	if (FAILED(hRes)) {
		AfxMessageBox("Error creating grammar");
		return false;
	}

	hRes = g_cpCmdGrammar->LoadCmdFromResource(
		NULL,
		MAKEINTRESOURCEW(IDR_SAPI0),
		L"SRGRAMMAR",
		MAKELANGID( LANG_NEUTRAL, SUBLANG_NEUTRAL), SPLO_DYNAMIC);
	
	if (FAILED(hRes)) {
		AfxMessageBox("Error creating grammar rules");
		return false;
	}
We now have an engine and a context that engine will work in, but we don't have any grammar rules for the engine to work with. Therefore, we create a grammer object for the recognition context. The grammar doesn't have any rules yet, so we need to load it from the resource section of the application. As of yet, we have not added any sort of grammar - but when we do, it will be called "IDR_SAPI0" and have a custom type 'SRGRAMMAR'. Adding grammar will be our next step after finishing this...
	hRes = g_cpCmdGrammar->SetRuleState(NULL, NULL, SPRS_ACTIVE );
	if (FAILED(hRes)) {
		AfxMessageBox("Error setting rule state");
		return false;
	}
This effectively turns on all the rules that have been added.
	return true;
}
Summary: Still confused? A brief re-cap might help - we have looked at initializing SAPI. SAPI requires three things (in this case) - an instance of its engine, a recognition context to work within, and some grammar rules. Grammar rules help the SAPI engine recognize what is being said, since knows in advance what to expect.

Adding Grammar

Ingeniously, the grammar rules used by SAPI are defined using XML (eXtensible Markup Language), making it easy for anyone with any HTML (or any other XML-derivative) to write grammar rules. I show show a shortened version of the grammar used in our program to aid readability:
<GRAMMAR LANGID="409">
    <DEFINE>
        <ID NAME="VID_Red" VAL="1"/>
        // REMOVED
        <ID NAME="VID_Commands" VAL="257"/>
    </DEFINE>

    <RULE ID="VID_MainDraw" TOPLEVEL="ACTIVE">
        <O>Please</O>
        <P>draw</P>
        <O>a</O>
        <P>
            <RULEREF REFID="VID_ColourType" />
            <RULEREF REFID="VID_DrawType" />
        </P>
    </RULE>

    // REMOVED

    <RULE ID="VID_ColourType">
        <L PROPID="VID_ColourType">
            <P VAL="VID_Red">red</P> 
            <P VAL="VID_Green">green</P> 
            // REMOVED
        </L>
    </RULE>

    <RULE ID="VID_DrawType" >
        <L PROPID="VID_DrawType">
            <P VAL="VID_Square">square</P>
            <P VAL="VID_Circle">circle</P>
            <P VAL="VID_Triangle">triangle</P>
        </L>
    </RULE>
	
	// REMOVED

</GRAMMAR>
Ok, as you can see the entire grammar is surround by "GRAMMAR" tags. Next comes the "DEFINE" section that assigns values to the various IDs (VAL tags) that you will use throughout the grammar. Next are the rules. The first rule is the VID_MainDraw rule which has an "ACTIVE" tag associated with it, meaning this is something the SRE should expect to hear. MainDraw consists of:

Where any word in parenthesis is an optional word, and any set of words separated by a pipe ('|') means they are a list of words that can be said in that position. The ellipsis simply meant that there are in fact 7 different colours, but for brevity I listed the first three.

Now, from that diagram, it is easy to see that the <O> tag denotes an options word, whereas <P> denotes a phrase. We can also see from the grammar rules that we can embed rules within rules to allow reuse and code readability. Let us look at VID_ColourType. We can see that the rule simply consist of a list of possible colours along with their associated values. When the SRE recognizes one of these colours in the given grammar we will be able to tell what colour by using the VAL tag.

Compiling XML

The XML has to be compiled into a .cfg file, and we must generate the necessary IDs for use within our C++ program. We must set up Microsoft Visual Studio to compile the XML using the grammar compiler. Obviously, our example program has the project file previously set up, but for future reference here is how you do it.

Add your XML file to your project, and select the "Project, Settings" and select your XML file. Now type "gc /h [XML filename].h $(InputName)" in the Commands section and "$(ProjDir)\sapi0.cfg $(ProjDir)\sapi.h" in the Outputs box. Your build should look something like this:

Select ok and try to compile the XML file. If it doesn't work, check the Project Settings again and make sure your altered the executable directory (see Setting up MSVC++).

Fitting it Together

Now that you have your compiled grammar file, you must add it to your project. Therefore, go to ResourceView and select "Import". Open "sapi0.cfg", making sure it is opening the file as a custom type (Open As). When it prompts you for a custom type, enter "SRGRAMMAR" (including quotation marks). Now, rename your new resource "IDR_SAPI0" so that our call to LoadCmdFromResource will work.

We now have an engine, a recognition context, a grammar instance and a compiled grammar file! All we need now is a way to handle the speech recognition messages. We added the handler declaration at the beginning of the tutorial, now we have to add the necessary code:

LRESULT CSapiTutorial0Dlg::OnRecoEvent(WPARAM wParam, LPARAM lParam) {
    // Event helper class
    CSpEvent event;  
	
    // Loop processing events while there are any in the queue
    while (event.GetFrom(g_cpRecoCtxt) == S_OK)
    {
        // Look at recognition event only
        switch (event.eEventId)
        {
            case SPEI_RECOGNITION:
            ExecuteCommand(event.RecoResult());
            break;
        }
    }

    return 0;
}
What we do is use a helper class (handles any repetitive tasks we might have to perform) to get the type of event being sent to the application. If it is a recognition message, then call ExecuteCommand and pass the recognition result across. Our ExecuteCommand look like this (edited):
void CSapiTutorial0Dlg::ExecuteCommand(ISpPhrase *pPhrase) {
    SPPHRASE *pElements;

    UINT uType = 0;
    COLORREF crShape = RGB(0,0,0);

    // Get the phrase elements, one of which is the rule id we specified in
    // the grammar.  Switch on it to figure out which command was recognized.
    if (SUCCEEDED(pPhrase->GetPhrase(&pElements)))
    {        
        switch ( pElements->Rule.ulId )
        {
            // Removed...

            case VID_MainDraw:
            {
                const SPPHRASEPROPERTY *pProp = pElements->pProperties;

                while (pProp) 
                {
                    switch(pProp->vValue.ulVal )
                    {
                        case VID_Square:	uType = VID_Square; break;
                        case VID_Circle:	uType = VID_Circle; break;
                        case VID_Triangle:	uType = VID_Triangle; break;
                        case VID_Red:		crShape = RGB(255,0,0); break;
                        case VID_Green:		crShape = RGB(0,255,0); break;

                        // Removed...
                    }

                    pProp = pProp->pNextSibling;
                }
            } DrawCommand(uType, crShape); break;
        }

        // Free the pElements memory which was allocated for us
        ::CoTaskMemFree(pElements);
    }
}
Did you get that?! Firstly, we get the elements from the phrase passed (look at the SAPI 5.0 help for SPPHRASE) then find out which rule was recognized, if it was the VID_MainDraw, then we cycle through the various properties (words), each time figuring out which shape or colour was mentioned and setting our various data types accordingly. The best way to understand this code is to go through it with the debugger. After figuring out what was said, our call to DrawCommand will draw the shape in the colour asked for. See the code for details.

Conclusion

This was a long and complicated tutorial, but SAPI 5.0 is pretty much cutting-edge technology (at the time of this article) and anything cutting-edge is not going to be simple. Nevertheless, once the basics of XML grammar rules and handling SAPI events is understood, it can be easy to add speech recognition to your programs.

Note that with the example program, it also allows you to show the about box, quit and go to Generation5.org all through the speech interface. Have a good look through the code since most of the power remains in the XML file, which is easy to understand.

Submitted: 17/01/2001

Article content copyright © James Matthews, 2001.
 Article Toolbar
Print
BibTeX entry

Search

Latest News
- The Latest (03/04/2012)
- Generation5 10-year Anniversary (03/09/2008)
- New Generation5 Design! (09/04/2007)
- Happy New Year 2007 (02/01/2007)
- Where has Generation5 Gone?! (04/11/2005)

What's New?
- Back-propagation using the Generation5 JDK (07/04/2008)
- Hough Transforms (02/01/2008)
- Kohonen-based Image Analysis using the Generation5 JDK (11/12/2007)
- Modelling Bacterium using the JDK (19/03/2007)
- Modelling Bacterium using the JDK (19/03/2007)


All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -