University of Toronto
CSC148S--Introduction to Computer Science, Spring 1997

Assignment 0: String Matching


Due electronically:  Thursday 23 January, 6:00 pm
This assignment is worth 1 mark towards your grade in this course. Part marks will count.

Introduction

This is a warm-up assignment to get everyone going, particularly students who haven't taken CSC108 and are therefore unfamiliar with our computer systems and possibly also with the programming language (the C language for the 1:00 class, and OOT for all other sections).

Coming up with the algorithms required for this assignment should be a straightforward task for all CSC148 students. If you don't find it so, then you may need to get more programming experience before taking 148.  

The problem

The Human Genome Project is an international effort to map out and analyze the complete set of human genes. One kind of analysis that scientists wish to do is to take the sequence of nucleotides that constitute a gene whose function is not known and find another gene with similar structure whose function is known. The function of this matching gene may provide clues to the function of the gene under analysis. The matching of gene sequences can also provide insight into the evolution of organisms. For instance, the similarities and differences in the structure of the hemoglobin gene in related organisms can provide evidence for how recently the organisms diverged.  

Since the sequence of nucleotides that makes up a particular gene can be represented by a string of characters, the gene-matching problem is an instance of a general string-matching problem.  

In this assignment, we will tackle a simple string-matching problem: Given two strings of any length, find the longest substring (of length 1 or greater) common to both. For instance, if the strings are: xypdzxabcdelmndeghij   and   abcdefghijk

then the longest common substring, abcde, is 5 characters long and occurs at position 7 in the first string and position 1 in the second string: xypdzxlmndeghij   and   fghijk

Your task is to write a program that reads in two strings and finds the longest substring common to both. The input strings will be on two separate lines of the input file. The output should be:

       There were no common substrings
if there were no substrings common to both input strings (in which case, there must not have been even any characters common to both), or, if there was at least one substring common to both:
       The longest common substring has length: l
       It occurs in string 1 starting at character: i
       It occurs in string 2 starting at character: j
where i, j, and l are appropriate integers. In the case of a tie, the substring closest to the beginning of the first string is chosen. If this also yields a tie (because that substring occurs more than once in the second string), the occurence closest to the beginning of the second string is chosen.  

An approach to solving the problem

One way to solve this problem involves using an incidence matrix. The incidence matrix IM for two strings s1 and s2 is a boolean matrix of size length(s1) tex2html_wrap_inline138 length(s2), such that for all i, j, where tex2html_wrap_inline144 length(s1) and tex2html_wrap_inline148 length(s2), IM(i, j) = true iff (s1(i) = s2(j)).

The incidence matrix for the sample strings used above is shown below. To make it easier to read, we have printed the two strings down the left and across the top.

tabular61

Notice several important things about an incidence matrix:

In this assignment, we will only be concerned with alignments that are forwards and have no gaps.  

An obvious and simple algorithm for matching two strings, given their incidence matrix, is to look for the longest sequence of trues along one of the main diagonals.  

How to write your program

We have already written parts of the program for you. There are two versions: one for students in the 1:00 class, who must work in C; and one for all other students, who must work in OOT. You must use our code as your starting point.  

To save typing, our starter code is available online through the course web page located at http://www.cs.utoronto.ca/ tex2html_wrap_inline160 dianeh/148.html

(You will also find there an example of an input file and the output that it should produce.) Make your own copy of the starter code using your web browser. (With Netscape, you can use the ``Save As'' command under the ``File'' pull-down menu.) Then edit the program and insert the missing pieces in the places marked by >> and <<.  

You must use our starter code, and you must not make any alterations to it. In particular, do not change the subprogram headers and do not change the input or output statements. Do not add prompts for the input.  

You will probably find it convenient to break some of the subprograms down further into smaller subtasks. Feel free to add additional subprograms.  

How to hand in your assignment

For this assignment, your solution will be marked by a computer program, which will put it through a thorough set of test cases. We will not tell you what these tests are. Because the assignment is simple, the marking scheme will be tough. You should therefore test your code very thoroughly before handing it in.  

Because this assignment is to be marked by a program, you will hand it in electronically rather than on paper. Name the file containing your completed program a0.t if it is written in OOT or a0.c if it is written in C. When you are convinced of the correctness of your code, hand it in electronically using the ``Hand in (submit)'' command, found under the ``File Utilities'' icon.

If you hand in a solution more than once, we will have only a copy of your most recent version; our copy of the older ones will be erased.

The program that will mark your assignment is not terribly clever, and can only compare your program's output character for character with what we tell it is the correct output. If your output is different in any way from the expected output, it will be considered wrong, and your mark will be zero. For this reason, we are providing, among other things, the statements that perform input and output. If you modify these in any way, or add other input or output statements (for example, if you add prompts) your mark will be zero. In addition, you must use the file name a0.t or a0.c (as appropriate) when you hand in your assignment, or your mark will be zero.  

Some useful advice

Students who are already familiar with the programming language used in their section of the course should need only a couple of hours to complete this assignment. Students who are working in OOT but are not familiar with the language will need extra time to learn the details of OOT and its programming environment. Students who are working in C but are not familiar with the language should speak to Diane Horton for advice; it is harder to teach yourself C than to teach yourself OOT.  

The most effective way to get this assignment done quickly is to write your program out on paper and simulate it by hand before you go near a computer, in order to get rid of as many errors as possible.  

Even if you plan to develop your program at home, you will have to use one of our PCs to hand in your final program. Make sure you log in and get comfortable with our computers well before the due date.  

Try to get your assignment finished early. The computers are likely to be very crowded near the due date.  

About future assignments

This assignment is unusual in comparison to the rest of the assignments in CSC148. First, it is far less difficult than the assignments to come. Second, you will hand it in electronically, whereas future assignments will generally be handed in on paper. Finally, only the correctness of your code will be evaluated. Future programs will be evaluated also for programming style, documentation, and testing, and will require a report. These aspects may be worth as much as 75% of future programming assignments.  

Check the 148 web site

Remember that it is your responsibility to regularly check the course web site for announcements. We often post there answers to common questions concerning assignments, and sometimes corrections to assignments. Because this is the first time we are teaching 148 concurrently in both OOT and C, there may be a few glitches.

The course web site is located at http://www.cs.utoronto.ca/ tex2html_wrap_inline160 dianeh/148.html


Diane Horton
Tue Dec 31 15:26:16 EST 1996