The central objective of my research is to develop a sophisticated computational lexical-choice process that can choose from a group of near-synonyms the one that best achieves the desired effects in the current context. The process is to be broadly applicable in machine translation and natural language generation systems. To achieve this, I argue that an explicit representation of the differences between near-synonyms is required, but not necessarily in a knowledge-based formalism. Thus, I investigate two complementary approaches.
In the first, differences between near-synonyms are represented as differences in the statistical co-occurrence of the near-synonyms with other words in large text corpora. Then, given a novel context and a set of near-synonyms to choose from, one can determine which near-synonym is the most typical choice.
The second approach is a traditional knowledge-based approach. As a result of studying the form and content of usage notes in synonym-discrimination dictionaries (e.g., Webster's New Dictionary of Synonyms), I identify several different components of fine-grained word meaning including denotation, style, attitude, and collocations. I then propose a clustered model of lexical knowledge that has two levels of representation: a core concept and peripheral concepts. All of the near-synonyms in a cluster share the core denotation, which is represented as a configuration of concepts defined in an ontology, and is used as a necessary applicability condition for each of the words. All differences between the words are represented explicitly as differences in the denotation, suggestion, or emphasis of peripheral concepts or as differences in style or attitude. The best word is chosen by finding the word that most closely matches (according to structural similarity and fuzzy similarity) a set of preferences for expressing certain ideas or using certain styles. The system is implemented in approximately 2000 lines of Lisp code.