University of Toronto
Department of Computer Science
CSC148H, Summer 2007


Assignment 4:   (Due Wed August 8th @ 9am)


Introduction:

Computational Biology has become a very hot research area nowadays. Comparing DNA or amino acid sequences to measure their similarity is a fundamental task that helps biologists to classify organisms. Sequences which are similar may either have descended from a common evolutionary ancestor, or may have evolved to have a similar function. Measuring similarity is still a challenging task as the biological sequences are composed of letters from a fixed alphabet but similarity scores are numerical values. How can we measure the similarity between two DNA sequences, say, "CGCCTC" and "CGTCTGC"? One approach is to first convert each sequence (string) to an Euclidean vector (called feature vector), and then take their dot product as the similarity score. In this assignment, we will see two ways to do this called "string kernels" and "mismatch string kernels". Don't be frightened by these terms! We will explain them and you will find that it's not hard to do the calculations.

Goals:

The goals of this assignment are a) to expose you to solve a real problem using object-oriented concepts, b) to exercise using advanced data structures like Trees in order to have efficient implementation, and c) to gain experience in reading and understanding existing code and integrating/adapting it with the part that you will develop.

The Problem:

In this assignment, you are asked to solve a real problem in Computational Biology: computing the similarity matrix (called Kernel Matrix) for a set of sequences. A sequence is a finite series of letter from an Alphabet (for example, AATGCCGACCA is a DNA sequence with alphabet {A,T,G,C}). Given a set of N sequences, the Kernel Matrix K is defined as an N*N matrix where the (i,j)-th element, i.e. K(i,j), is the dot product between the numerical feature vectors of sequence i and sequence j. For this, we need the definition of the feature vector of a sequence and the dot product. As we will see, the naïve way of representing the feature vectors by real vectors is impractical due to the huge size of vectors, so will show you how to use a special Tree structure (called Trie) to store feature vectors and compute their dot products efficiently.

1. Dot Product of Two Vectors

The dot product of two Euclidean vectors is the sum of the corresponding vector elements product. For example, the dot product of 10-dimensional Euclidean vectors x=(3,0,5,0,2,0,3,2,0,2) and y=(2,4,4,1,0,3,2,5,3,0) is 3*2 + 0*4 + 5*4 + 0*1 + 2*0 + 3*2 + 2*5 + 0*3 + 2*0 = 42. For very large vectors with mostly zero elements, this calculation can be made more efficient if we somehow store ONLY nonzero elements of vectors and have a way of identifying the coordinate in the vector that correspond to each such element. More on this later.

2. Feature Vectors of Sequences

4. Building the Kernel Matrix

Recall that if we have N sequences, we will need to compute the dot products between every possible pair including the pair of each sequence with itself, and store them in a matrix denoted by K (this matrix is called often 'Kernel Matrix' in Machine Learning literature). If the input sequence set is in a file with one line corresponding to one sequence as follows:

seq 0
seq 1
seq 2
seq 3
...
seq N-1

then K(i, j) should be the dot product between seq i and seq j. For example, K(1, 1) is the dot product between seq 1 and itself. We have provided you with KernelMatirx.java. You need to complete the calcKernelMatrix method (simply by using methods in KmerTrie).

What you should do

  1. read the starter code and make sure you understand what's going on there, especially the relations between different classes. We have provided you with complete TrieNode, LeafNode, InternalNode, TrieException, and Alphabet classes (DO NOT CHANGE THESE).
  2. complete the KMerTrie class (a lot of code and wrappers are given, you just complete the following methods (using proper helpers if necessary):
    • KmerTrie constructor,
    • addKmerWithoutMismatch,
    • addKmerWithMismatch,
    • dotProduct,
    • getKmerCount,
    • getListofKmers
  3. complete the KernelMatrix class (again most of the code is given, just implement calcKernelMatrix).
Note that efficiency is important so avoid redundancy. You should also complete the documentations (including JavaDoc if necessary). Make sure NOT TO CHANGE any method already completely given (some of these like the toString methods will help you to debug your trie or matrix).

Submission

Files to submit: Files not to be changed and not to be submitted:

Marking Scheme