No Title

Next: About this document

Department of Computer Science / University of Toronto
csc148F - Introduction to Computer Science, Fall 1995

Assignment 5: A bilingual spelling checker

Due 6 p.m., Thursday 7 December 1995

Introduction

In this assignment, you will write a spelling checker that knows two languages-English and Ruritanian-and that can therefore check the spelling of bilingual documents.

For each of the two languages, your program will have a lexicon of legal words. It will read in a text, and look each word up in the lexicons. Any word in the input that isn't in one of the lexicons will be flagged as a possible spelling error. The program will also know a bit about how words are formed in each language. For example, if it sees the English word getting, it will know that the word is correctly spelled if it can find get in its English lexicon.

The program will have four options: Check words for spelling in both languages, that is, accept words that are either English or Ruritanian; check spelling in English only; check spelling in Ruritanian only; or just count the number of words in the file. (The program also counts words if it's checking spelling.)

An important part of this assignment will be the use of classes and modules to deal neatly with the parts of the program that deal with English, those that deal with Ruritanian, and those that are the same for both languages.

How to check spelling

Lexicons

Any spelling checker must have a lexicon: a list of words that are correctly spelled. A bilingual spelling checker will need two lexicons, one for each language. And, of course, for each lexicon, we need a way of quickly checking whether a particular word in the input is actually in the lexicon.

Implementation: A simple method for storing a lexicon is as an array of strings. If the strings are stored in alphabetical order, then a binary search can quickly determine whether or not a given word is present.

Since you will need two lexicons, you should define a class for general lexicons that deals with all the aspects of lexicons that are independent of each language. So each object in the class will contain an array-the list of words-and procedures to read that list in and to perform binary search on it.

We have provided English and Ruritanian lexicons for you. Files with the base forms of the most common words of each language, sorted alphabetically (900-1000 base forms for each language), are available as pickup files 148eng.lex and 148rur.lex. (Note that you don't need to sort the lexicons; they come pre-sorted, and never change. However, your program should check each lexicon as it reads it in to make sure that it is indeed sorted.)

Words

The input to a spelling checker is a stream of tokens, each of which is to be checked to find out whether it is really a word. A token is a string of characters delimited by white space. Usually, we think of a word as nothing more than a token. However, this definition is not quite good enough for a spelling checker, because in a document words may be spelled in lower case, upper case, or some mixture; and they may have punctuation attached. Before we can look a token up in the lexicon, we need to convert it to a canonical form by removing numbers and punctuation and converting all characters to lower case. For example, if the input were this:

"WHAT??!" screamed Lisa 23 times.

then the canonical forms for the tokens would be what, screamed, lisa, and times.

Rules of English word formation

English is an inflected language, which means that you can add various endings to base-form words in the language to get new words. For example, sneeze is a base-form word, and so sneezes, sneezed, and sneezing are words too; they are called inflected forms of the base word. We could simply store all these different words in the lexicon along with the base form, but this would make the lexicon require several times as much memory.

So instead, when checking a token, we look to see whether it might be an inflected form; if so, then we work out what the base form is, and check for that in the lexicon.

Table 1 lists the most common inflections in English, which your spelling checker should know about. Notice that recognizing each inflection requires taking some letters off the end of the token, and, in some cases, adding new letters. Notice that in English, an inflected word must be at least four letters long; if a token is only three letters or fewer, there is no point in checking it for inflections.

Table 1: Common inflections in English.

Two inflections, ing and ed, have the special property that they allow consonant doubling; for example, slam becomes slamming and slammed, not slaming and slamed. So after you've taken off the inflection, if the last two letters are the same consonant (e.g., if you've converted slamming to slamm), you should check the base form with and without the double consonant (e.g., check for both slamm and slam in the English lexicon).

So the method for checking whether a token is an English word is as follows:

Put the token into canonical form.
Check whether that canonical form is in the English lexicon; if so, you are finished.
If the token is at least four letters long, then for each inflection in Table 1, see whether the token could be a base form with that inflection added. If so, check whether that base form is in the lexicon, and if so, you are finished.
Otherwise, the token is not a recognized English word.

(Notice that this approach is not perfect, because it will not detect illegally inflected words, such as aring and thinged-it will think that they are correctly spelled because are and thing are base-form words.)

Rules of Ruritanian word formation

Ruritanian is a language spoken in several regions of eastern Europe. It is an agglutinative uninflected language with prefix freeze, which means that it works like this:

The language has a number of base-form words, which are listed in its lexicon.
There are no inflections or inflected forms.
The base-form words have the property that none is a string prefix of another. For example, yowd is a base word; therefore none of its prefixes, y, yo, and yow are legal words, nor are any strings of which it is a prefix, such as yowda, yowdel, etc.
You can always join two or more words together to make a new word (which means that the language has an infinite number of words). For example, yowd, teal, and birsz are words (which mean `eat', `frog', and `my brother'), so yowdtealbirsz is a word (which means `the frog that ate my brother') and so is yowdbirszteal (which means `the frog that my brother ate').

To decide whether or not a token is a correctly spelled word in Ruritanian, you have to see whether or not it can be separated into base forms that can be found in the lexicon. The easiest way to do this is to work down the token, trying to find the first base form in it. If this succeeds, we remove the base form and (recursively) see whether what remains is a legal word. For example, with yowdtealbirsz, we would check whether y, yo, and yow were base forms, but we'd fail. Then, when we find that yowd is a base form, we would check whether tealbirsz was a word; again, we'd try removing prefixes, eventually find that teal was a word, and start over again on birsz.

Notice that this algorithm works only because of the no-prefix property of Ruritanian. If both yow and yowd were words, then when checking yowdtealbirsz, we would remove yow, check dtealbirsz (which isn't a word), and get the wrong answer.

Implementation: The rules for checking tokens for Ruritanian and English should be implemented in separate modules, one for each language. Each module will need to have within it an instantiation of the lexicon class for that language. (An important part of this assignment is for you to learn about the similarities and differences between modules and classes, and how to work with both of them.)

The user interface

When your program starts, it will load both the English and Ruritanian lexicons from files, and then ask the user for the name of a file whose spelling is to be checked. The user will then be offered four options: check the file for English; check the file for Ruritanian; check it for both languages; or just count the words.

The program then reads the file, token by token, and prints a warning message whenever it finds a word that it can't recognize in the relevant language. See figure 1 for an example. (In the example, many of the possible errors are actually correctly spelled words that just weren't in the lexicon.) When the program finishes, it prints a count of the number of possible spelling errors and of the number of words in the file.

Note that your program will not be as sophisticated as commercial spelling checkers, which allow the user to correct their mistakes and to add new words to the lexicon while running the program.

Figure 1: Example of spelling checker in operation.

Questions to answer

Derive a recurrence relation to describe the number of times that your Ruritarian lexicon must be searched, in the worst case, when checking whether a token of length is a Ruritanian word.
Hint: You can come up with a better (tighter) bound if you remember that no base-form word in Ruritarian is a prefix of another. For instance, if we are checking to see if fjord is Ruritarian, at most one of its 4 prefixes (f, fj, fjo, and fjor) can be a base-form word. That means that at most one of them has to have its remainder checked to be sure that it is a Ruritarian word. Part of your mark will be for the tightness of your bound.
Solve the recurrence relation (but do not prove your answer to be correct). Show how you arrived at your answer.
Let be the number of words in your Ruritanian lexicon. Use big-oh notation to express the worst-case time complexity of checking whether a token of length is in the Ruritanian lexicon.
Again, let be the number of words in your Ruritanian lexicon. Use big-oh notation to express the worst-case time complexity of checking whether a token of length is a Ruritanian word. Explain how your arrived at your answer.

What to hand in

Hand in:

Your program listing.
Thorough annotated test runs of your program. Don't forget also to explain your testing strategy in the `testing' section of your report.
A full report, as described in the course handbook, including your answers to the questions. Put the answers on a separate page (or pages) entitled ``Time Complexity''. Reprint the relevant subprograms here so that the marker can easily check your answers against your code.

The cover sheet provided here must be filled out, signed, and taped to the envelope containing your assignment.

Department of Computer Science / University of Toronto
csc148F - Introduction to Computer Science, Fall 1995

Cover sheet for Assignment 5

Complete this page and include it in the envelope with your assignment.

About this document ...

Next: About this document

Diane Horton
Thu Nov 23 10:52:13 EST 1995