Professor of Computational Linguistics

University of Toronto, Department of Computer Science

Research

Semantic distance

Many applications in NLP can usefully exploit the idea of determining the semantic distance between the concepts or words in a text or in two or more texts. For example, the words bus and train are semantically close, but canary and mendacity are not. Real-word spelling errors in a text can often be spotted by noticing that a word that is semantically distant from the rest of the text could be a misspelling of one that is semantically closer; a query and a document that answers the query will have an overall semantic closeness even if they use different terms to describe the same concepts. We have developed measures of semantic distance that are based on graph distance in various lexical resources such as WordNet and thesauri.

The rise of WordNet led many researchers to propose measures of semantic distance in WordNet that were more sophisticated than Morris and Hirst's simple link-counting. Alex Budanitsky experimented with a variety of these measures to compare their properties and suitability for different applications; he showed that there are significant differences between the measures, and that an intelligent spelling corrector for real-word spelling errors that approaches practical usefulness can be based on a good measure (Budanitsky and Hirst 2006). Semantic distance information can also improve the accuracy of word-completion software for disabled users. Budanitsky and Hirst (2006) and Mohammad and Hirst (2006a) explained why measures based on semantic resources are superior to those based solely on lexical distributions. Tong Wang further refined our understanding of WordNet-based measures (Wang and Hirst 2011).

The earlier research considered only distances between words; senses and disambiguation was only implicit, with the shortest distance being chosen when one or both words in a pair is ambiguous. However, Saif Mohammad developed a method that gets more directly at underlying concepts without the need for sense-annotated data and to create distributional profiles of a concept from just raw text. A published thesaurus is used both as coarse-grained sense inventory and a source of (possibly ambiguous) words that together unambiguously represent each sense (Mohammad and Hirst, 2006b). Even a simple implementation of this idea rivaled WordNet-based methods on real-word spelling correction.

We are now looking at cross-lingual generalizations of these measures. A cross-lingual method using distributional profiles of concepts was first developed by Mohammad et al (2007). Alistair Kennedy (2012) subsequently developed a method that does not use a parallel corpus but rather is seeded with a set of known translations. We have found that this measure correlates more closely with averaged human scores than unilingual baselines.

This research is a continuation of our project on lexical chains and threads of meaning in documents.

References

Budanitsky, Alexander and Hirst, Graeme. “Evaluating WordNet-based measures of lexical semantic relatedness.” Computational Linguistics, 32(1), March 2006, 13-47. [PDF]

Kennedy, Alistair and Hirst, Graeme. “Measuring semantic relatedness across languages.” Proceedings, xLiTe: Cross-Lingual Technologies Workshop at the Neural Information Processing Systems Conference, 2012, December, Lake Tahoe, NV. [PDF]

Mohammad, Saif and Hirst, Graeme. “Distributional measures of concept-distance: A task-oriented evaluation.” Proceedings, 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), Sydney, Australia, July 2006, 35-43. [PDF]

Mohammad, Saif and Hirst, Graeme. “Determining word sense dominance using a thesaurus.” Proceedings of the 11th conference of the European chapter of the Association for Computational Linguistics (EACL-2006), April 2006, Trento, Italy, 121-128. [PDF]

Mohammad, Saif; Gurevych, Iryna; Hirst, Graeme; and Zesch, Torsten. “Cross-lingual distributional profiles of concepts for measuring semantic distance.” 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), Prague, June 2007, 571–580. [PDF]

Wang, Tong and Hirst, Graeme. “Refining the notions of depth and density in WordNet-based semantic similarity measures.” 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, July 2011, 1003–1011. [PDF]