Current word-prediction utilities rely on little more than word unigram and bigram frequencies. Can part-of-speech information help? To answer this question, we first built a testbench for word prediction; then introduced several new prediction algorithms which exploit part-of-speech tag information. We trained the prediction algorithms using a very large corpus of English, and in several experiments evaluated them according to several performance measures. All the algorithms were compared with WordQ, a commercial word-prediction program. Our results confirm that strong word unigram and bigram models, collected from a very large corpus, give accurate predictions. All predictors, including that based on word unigram statistics, outperform the WordQ prediction algorithm. The predictor based on word bigrams works surprisingly well compared to the syntactic predictors. Although two of the syntactic predictors work slightly better than the bigram predictor, the ANOVA test shows that the difference is not statistically significant.
Download: gzipped
PostScript file (448 Kb); PostScript file (2420 Kb); PDF file (532 Kb).
Request paper copy: Send
request with postal address to
gh@cs.toronto.edu.
[an error occurred while processing this directive]
[an error occurred while processing this directive]