At the forefront of Artificial Intelligence
  Home Articles Reviews Interviews JDK Glossary Features Discussion Search

How Does Spell Checking Work?

This article is meant as an overview of basic spell checking, one of the more basic natural language processing operations. This article will not cover Bayes Theorem or conditional probability, but there are plenty of online resources to help.

Noisy Channel Model

Spell checking can be thought of in terms of the noisy channel model (NCM). The NCM is essentially a mathematical idealization of a communication system. The model can be described in terms of input, output, noise, target classes and the estimated input.

Noisy Channel Model
Input I Word sequences
Output O Word sequences (with mistakes)
Noise Spelling Errors
Target Classes All possible English words
Estimated Input Î Corrected words

You can therefore formally define the goal of a spell checker to create a list of the most probable correct words given a certain input, or:

î = argmax(p(i|o))

Or using Bayes:

î = argmax(p(o|i)p(i))

Where p(i) is the prior probability and p(o|i) is the likelihood. In our spelling example, this means p(i) is the probability of a word sequence, whereas p(o|i) is the model of spelling errors. To clarify, rewrite the formula as:

ĉ = argmax(p(t|c)p(c))

Which roughly reads as "the best correction is the correct word that occurs most often with this typo.

What are Spelling Mistakes?

How does this all relate to spell checking though?

Firstly, let us formalize more precisely what a spelling error can be. Spelling errors principally fall into two categories: non-word errors and real-word errors. As implied, non-word errors are spelling errors that yield nonsensical words ("yaht" instead of "yacht"), while real-word errors yield other valid words ("stationary" vs. "stationery").

Spelling errors can also be split into two categories: typographical errors and cognitive errors. Typographical errors stem from mistyping ("hte" vs. "the") whereas cognitive errors are actual mistakes. It is also relevant at this point to note that spelling errors can be single- or many-error misspellings. For example, "mispeling" contains two errors. For the purpose of this article though, we will focus on single-error mistakes.

Spell Checking

Most spelling errors (in human-typed text) are single-error misspellings, and can be be classified as one of the following:

  • Insertion (x becomes xy)
  • Deletion (xy becomes x)
  • Substitution (x becomes y)
  • Transposition (xy becomes yx)

To detect spelling mistakes, the checker will examine every word in the text and compare it to a dictionary. The checker may need to perform basic morphological transformations ("foxes" becomes "fox", "swimming" becomes "swim"). Once the system has a list of the non-words in the text, it follows a simple algorithm:

  1. Generate a list of candidate corrections
  2. Rank the spelling variations
  3. Select the highest ranking as the most likely correction

Generating Candidates

Candidates are created by applying the four types of spelling errors, but backwards. For example, take the spelling error "acress". Candidates would be created like this:

Type Candidates
Insertion cress, aress, acess...acres
Deletion aacress, bacress, cacress...acressy, acressz
Substitution bcress, ccress, dcress...acresy, acresz
Transposition caress, rcaess...acrsse, acress

Candidates are then compared to a dictionary, and valid words are compiled into a list:

Correction Correct Error Position Type
actress t - 2 del
cress - a 0 ins
caress ca ac 0 tran
access c r 2 sub
across o e 3 sub
acres - s 4 ins
acres - s 5 ins

Each of these viable corrections is examined using the formula we looked at above (ĉ = argmax(p(t|c)p(c))), but where do all these probabilities come from? Probabilties comes from corpora analysis; massive collections of text are analyzed and statistics are gathered. Therefore, candidates might be evaluated this like:

c p(c) p(t|c) p(t|c)p(c)
actress 0.0000315 0.000117 3.69 x 10-9
acres 0.000065 0.0000342 2.22 x 10-9

You can see how using this system, you can rank the most likely candidates for a given spelling mistake.

Conclusion

This article has presented a simple overview of spelling checking. You can see though that this method is limited to single-error misspellings. More importantly though, this method only considers non-word errors since real-world spelling error requires context-sensitivity.

References

Jurafsky, D., Martin, J. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall. New Jersey: 2000.

Market, Katja. AI31: Natural Language Processing Lecture Notes. University of Leeds, 2003/4.

Submitted: 19/10/2004

Article content copyright © James Matthews, 2004.
 Article Toolbar
Print
BibTeX entry

Search

Latest News
- Generation5 10-year Anniversary (03/09/2008)
- New Generation5 Design! (09/04/2007)
- Happy New Year 2007 (02/01/2007)
- Where has Generation5 Gone?! (04/11/2005)
- NeuroEvolving Robotic Operatives (NERO) (25/06/2005)

What's New?
- Back-propagation using the Generation5 JDK (07/04/2008)
- Hough Transforms (02/01/2008)
- Kohonen-based Image Analysis using the Generation5 JDK (11/12/2007)
- Modelling Bacterium using the JDK (19/03/2007)
- Modelling Bacterium using the JDK (19/03/2007)


All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -