| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
How Does Spell Checking Work?
This article is meant as an overview of basic spell checking, one of the more basic natural language processing operations. This article will not cover Bayes Theorem or conditional probability, but there are plenty of online resources to help. Noisy Channel ModelSpell checking can be thought of in terms of the noisy channel model (NCM). The NCM is essentially a mathematical idealization of a communication system. The model can be described in terms of input, output, noise, target classes and the estimated input.
You can therefore formally define the goal of a spell checker to create a list of the most probable correct words given a certain input, or: î = argmax(p(i|o)) Or using Bayes: î = argmax(p(o|i)p(i)) Where p(i) is the prior probability and p(o|i) is the likelihood. In our spelling example, this means p(i) is the probability of a word sequence, whereas p(o|i) is the model of spelling errors. To clarify, rewrite the formula as: ĉ = argmax(p(t|c)p(c)) Which roughly reads as "the best correction is the correct word that occurs most often with this typo. What are Spelling Mistakes?How does this all relate to spell checking though? Firstly, let us formalize more precisely what a spelling error can be. Spelling errors principally fall into two categories: non-word errors and real-word errors. As implied, non-word errors are spelling errors that yield nonsensical words ("yaht" instead of "yacht"), while real-word errors yield other valid words ("stationary" vs. "stationery").Spelling errors can also be split into two categories: typographical errors and cognitive errors. Typographical errors stem from mistyping ("hte" vs. "the") whereas cognitive errors are actual mistakes. It is also relevant at this point to note that spelling errors can be single- or many-error misspellings. For example, "mispeling" contains two errors. For the purpose of this article though, we will focus on single-error mistakes. Spell CheckingMost spelling errors (in human-typed text) are single-error misspellings, and can be be classified as one of the following:
To detect spelling mistakes, the checker will examine every word in the text and compare it to a dictionary. The checker may need to perform basic morphological transformations ("foxes" becomes "fox", "swimming" becomes "swim"). Once the system has a list of the non-words in the text, it follows a simple algorithm:
Generating CandidatesCandidates are created by applying the four types of spelling errors, but backwards. For example, take the spelling error "acress". Candidates would be created like this:
Candidates are then compared to a dictionary, and valid words are compiled into a list:
Each of these viable corrections is examined using the formula we looked at above (ĉ = argmax(p(t|c)p(c))), but where do all these probabilities come from? Probabilties comes from corpora analysis; massive collections of text are analyzed and statistics are gathered. Therefore, candidates might be evaluated this like:
You can see how using this system, you can rank the most likely candidates for a given spelling mistake. ConclusionThis article has presented a simple overview of spelling checking. You can see though that this method is limited to single-error misspellings. More importantly though, this method only considers non-word errors since real-world spelling error requires context-sensitivity. ReferencesJurafsky, D., Martin, J. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall. New Jersey: 2000. Market, Katja. AI31: Natural Language Processing Lecture Notes. University of Leeds, 2003/4.
Submitted: 19/10/2004 Article content copyright © James Matthews, 2004.
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -