Intelligent spelling correction

Graeme Hirst

Professor of Computational Linguistics

University of Toronto, Department of Computer Science

Research

Real-word spelling errors (or malapropisms) are words in a text that, although correctly spelled words in the dictionary, are not the words that the writer intended. Such errors may be caused by typing mistakes or by the writer's ignorance of the correct spelling of the intended word. Ironically, such errors are also caused by spelling checkers in the correction of non-word spelling errors: the “autocorrect” feature in popular word-processing software will sometimes silently change a non-word to the wrong real word, and sometimes when correcting a flagged error, the user will inadvertently make the wrong selection from the list of alternatives that the software offers. The problem that we address in this research is the automatic detection and correction of real-word errors. This entails detecting words that are semantically anomalous in their context and for which a spelling variant of the word would fit much better. We have looked at both resource-based methods and machine-learning or statistical methods. Examples of resource-based methods are those of Hirst and St-Onge (1998) and Hirst and Budanitsky (2005), who use semantic distance measures in WordNet to detect words that are potentially anomalous in context — that is, are semantically distant from nearby words; if a variation in spelling results in a word that is semantically closer to the context, it is hypothesized that the original word is an error and the closer word is its correction. Wilcox-O'Hearn et al (2008) reconstructed and improved the trigram-based statistical method of Mays et al (1991), and showed the result to be superior to the resource-based methods.

This research is related to our work on semantic distance and on lexical chains and threads of meaning in documents.

References

Hirst, Graeme and Budanitsky, Alexander. “Correcting real-word spelling errors by restoring lexical cohesion”. Natural Language Engineering, 11(1), March 2005, 87-111. [PDF]

Hirst, Graeme and St-Onge, David. “Lexical chains as representations of context for the detection and correction of malapropisms”. In: Christiane Fellbaum (editor), WordNet: An Electronic Lexical Database, Cambridge, MA: The MIT Press, 1998, 305-332. [PDF]

Wilcox-O'Hearn, Amber; Hirst, Graeme; and Budanitsky, Alexander. “Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model.” In: Gelbukh, Alexander (editor), Proceedings, 9th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2008) [Haifa, February 2008], (Lecture Notes in Computer Science 4919), Berlin: Springer-Verlag, 2008, 605–616. Award for best poster. [PDF]