University of Toronto
Department of Computer Science
CSC148H, Summer 2007

Assignment 4: (Due Wed August 8th @ 9am)

Introduction:

Computational Biology has become a very hot research area nowadays. Comparing DNA or amino acid sequences to measure their similarity is a fundamental task that helps biologists to classify organisms. Sequences which are similar may either have descended from a common evolutionary ancestor, or may have evolved to have a similar function. Measuring similarity is still a challenging task as the biological sequences are composed of letters from a fixed alphabet but similarity scores are numerical values. How can we measure the similarity between two DNA sequences, say, "CGCCTC" and "CGTCTGC"? One approach is to first convert each sequence (string) to an Euclidean vector (called feature vector), and then take their dot product as the similarity score. In this assignment, we will see two ways to do this called "string kernels" and "mismatch string kernels". Don't be frightened by these terms! We will explain them and you will find that it's not hard to do the calculations.

Goals:

The goals of this assignment are a) to expose you to solve a real problem using object-oriented concepts, b) to exercise using advanced data structures like Trees in order to have efficient implementation, and c) to gain experience in reading and understanding existing code and integrating/adapting it with the part that you will develop.

The Problem:

In this assignment, you are asked to solve a real problem in Computational Biology: computing the similarity matrix (called Kernel Matrix) for a set of sequences. A sequence is a finite series of letter from an Alphabet (for example, AATGCCGACCA is a DNA sequence with alphabet {A,T,G,C}). Given a set of N sequences, the Kernel Matrix K is defined as an N*N matrix where the (i,j)-th element, i.e. K(i,j), is the dot product between the numerical feature vectors of sequence i and sequence j. For this, we need the definition of the feature vector of a sequence and the dot product. As we will see, the naīve way of representing the feature vectors by real vectors is impractical due to the huge size of vectors, so will show you how to use a special Tree structure (called Trie) to store feature vectors and compute their dot products efficiently.

1. Dot Product of Two Vectors

The dot product of two Euclidean vectors is the sum of the corresponding vector elements product. For example, the dot product of 10-dimensional Euclidean vectors x=(3,0,5,0,2,0,3,2,0,2) and y=(2,4,4,1,0,3,2,5,3,0) is 3*2 + 0*4 + 5*4 + 0*1 + 2*0 + 3*2 + 2*5 + 0*3 + 2*0 = 42. For very large vectors with mostly zero elements, this calculation can be made more efficient if we somehow store ONLY nonzero elements of vectors and have a way of identifying the coordinate in the vector that correspond to each such element. More on this later.

2. Feature Vectors of Sequences

String Kernel method (Exact match)
String kernel features for sequences are defined by counting the occurrence times of all possible subsequences with length k called k-mer (also called kmer). For example, if the alphabet has two letters {C, G} and k = 3, all possible kmers with length k being 3 are: [CCC, CCG, CGC, CGG, GCC, GCG, GGC, GGG]. We can use these kmers (here we have 8 kmers in total) as coordinates, and the coordinate value at each coordinate is the number of times of that particular kmer occurs in the sequence. For example, for the sequence "CGGCCGCC", the 8-dimensional feature vector of this sequence under the above coordinate system is [0, 1, 1, 1, 2, 0, 1, 0] (because CCC does not occur in the sequence, CCG occurs once, etc). In the above simple example, the length of the kmer is 3 and the size of the alphabet is 2.
Actually, for real sequences such as protein sequences, the size of the alphabet is 20 (protein sequences are composed of 20 Amino Acids); if we want to consider kmers with length 7, we will have totally 20⁷ possible kmers hence the feature vector has 1,280,000,000 elements. If we want to compute the dot product of two protein sequences with length 20 and 27, respectively, we must count the occurrence times of all 20⁷ kmers and then compute their dot products using the brute-force approach. This is prohibitively expensive! Note that sequences are often short and almost all the values in the vectors will be zero (why?). Instead, if we count ONLY the occurrence of the kmers appearing in the sequences under consideration, we will have a huge speedup since the lengths of the two sequences are often very short and zeros in the vector will not have any contribution to the dot product. For example, for two sequences with length 20 and 27, think about maximum number of kmers with k=7 they can contain? The upper bound is (20 - 7 + 1) + (27 - 7 + 1) = 35. This is much much smaller than 20⁷. Counting at most 35 kmers is trivial on our modern computers compared to counting all 20⁷ kmers (which we know many will occur zero times!).
But how do we store just the kmers in a sequence compactly and how do we compute the dot product between two sequences? Good question! We can create a list of Kmers with their count in each sequence and traverse both lists to find identical ones and do the multiplication. While this is much faster than the naīve way, it is still slow considering we need to do it many times to calculate the kernel matrix. In this assignment we use use a Tree data structure called Trie (see this link for a defintion of Trie; note that for us, since we store kmers with equal length, all leaves will be at the same depth). We will build a special Trie for each sequence to store its kmers and their number of occurrences. Then we will be able to traverse each Trie using all kmers in the another sequence to calculate their dot product. Here is an example:
If the Alphabet is {CG}, the full Trie schema for all kmers with length k = 2 is:
```
                                      root
                                   /       \
                                  C         G
                                /   \      /   \
                               C     G    C     G
```
There are 4 leaf nodes and 3 internal nodes in this Trie. Each leaf corresponds to one kmer, for example the left most leaf is the kmer CC. Note that we do not store letters in the internal node; instead since the alphabet is fixed every internal node has that many children and the position each child implicitly indicates the letter.
The Trie for any sequence is a sub-tree of this Trie (only containing kmers that appear in the sequence) where each leaf contains the kmer and its number of occurrences in the sequence. For example, the Trie for sequence "CCCG" is: root / \ C null / \ C G CC,2 CG,1 Now if you have another sequence say GCCCCCG its list of nonzero kmers is {(GC,1), (CC,4), (CG,1)}. If we traverse the first trie using each of these kmers and multiply the counts and sum all of these values we get the dot product: GC is not in the trie so it has 0 contribution to the calculations; CC is in the trie with count 2, multiplying this by 4 we get 8; CG has count 1 in the trie, multiplied by 1 we get 1. So, in total we have 0+8+1=9 as the dot product.
Mismatch String Kernel method
In the above method, we talked about string kernels by counting all possible ' kmers with length k to calculate feature vectors. However, the exact counting may not effectively capture similarity. For example, assuming the alphabet is {CG} and the length k of kmer is 3, if we have two sequences "CCGC" and "GCCC". The dot product of them is 0 using the string kernel calculation (why?). But these two sequences look like very similar! They have a common "CC" and both have the letter 'G'.
How can we modify the string kernel to take into account such similarities? The answer is simple: allow mismatches when we count the coordinate kmers in the sequence. We use m to denote the maximum number of mismatches allowed. If the number of mismatches of corresponding letters between a kmer in a sequence and a coordinate kmer is smaller or equal to m, we still think that kmer hits (matches) that coordinate. For alphabet {CG} and k=3, the coordinate kmers are [CCC, CCG, CGC, CGG, GCC, GCG, GGC, GGG], if we don't consider mismatches, the feature vectors of "CCGC" and "GCCC" are [0, 1, 1, 0, 0, 0, 0, 0] and [1, 0, 0, 0, 1, 0, 0, 0], and their dot product is 0. If we consider mismatches with m = 1 when we compare a kmer in the sequence to a coordinate kmer, the answer is very different. The first coordinate kmer "CCC" has two hits in sequence "CCGC" since the number of mismatch between "CCG" and "CCC" is 1 (which is <=m) and we have one hit for CCG, and the number of mismatch between "CGC" and "CCC" is also 1 (again <=m) and we have just one hit for CGC. So the coordinate value for sequence "CCGC" at coordinate "CCC" is 2. Following the above procedure, we can calculate the feature vector of "CCGC" to be: [1+1, 1, 1, 1+1, 1+1, 0, 1, 1, 0] = [2, 1, 1, 2, 2, 0, 1, 1, 0]. Likewise, the feature vector of "GCCC" when considering mismatches with m=1 is: [1+1, 1, 1, 0, 1+1, 1, 1, 0]= [2, 1, 1, 0, 2, 1, 1, 0]. Then the dot products between "CCGC" and "GCCC" when considering mismatches with m=1 is: 2*2 + 1*1+ 1*1 + 2*0 + 2*2 + 0*2 + 1*1 + 1*1 + 0*0 = 12. This result reflects our intuition about the similarity better. How do we implement Mismatch String using a Trie? It turns out it's very straightforwards. All we need to do is the following: when we are adding a kmer to the trie we also insert all other kmers with mismatch less than or equal to what's allowed. The dot product calculation remains the same.

4. Building the Kernel Matrix

Recall that if we have N sequences, we will need to compute the dot products between every possible pair including the pair of each sequence with itself, and store them in a matrix denoted by K (this matrix is called often 'Kernel Matrix' in Machine Learning literature). If the input sequence set is in a file with one line corresponding to one sequence as follows:


seq 0
seq 1
seq 2
seq 3
...
seq N-1

then K(i, j) should be the dot product between seq i and seq j. For example, K(1, 1) is the dot product between seq 1 and itself. We have provided you with KernelMatirx.java. You need to complete the calcKernelMatrix method (simply by using methods in KmerTrie).

What you should do

read the starter code and make sure you understand what's going on there, especially the relations between different classes. We have provided you with complete TrieNode, LeafNode, InternalNode, TrieException, and Alphabet classes (DO NOT CHANGE THESE).
complete the KMerTrie class (a lot of code and wrappers are given, you just complete the following methods (using proper helpers if necessary): KmerTrie constructor, addKmerWithoutMismatch, addKmerWithMismatch, dotProduct, getKmerCount, getListofKmers
complete the KernelMatrix class (again most of the code is given, just implement calcKernelMatrix).

Note that efficiency is important so avoid redundancy. You should also complete the documentations (including JavaDoc if necessary). Make sure NOT TO CHANGE any method already completely given (some of these like the toString methods will help you to debug your trie or matrix).

Submission

Files to submit:

KMerTrie.java
KernelMatrix.java

Files not to be changed and not to be submitted:

TrieException.java
TrieNode.java
LeafNode.java
InternalNode.java
Alphabet.java

Marking Scheme

Correctness (Automarking): 50% [40% for KmerTrie and 10% for KernelMatrix)
Test cases (KmerTrie and KernelMatrix): 15%
Programming style (according to Check-Style): 10%
Documentation (internal comments, variable names, ...): 10%
Design of methods/helpers: 15%

University of Toronto Department of Computer Science CSC148H, Summer 2007