"CGCCTC"
and "CGTCTGC"
? One approach is to first convert each sequence (string) to an Euclidean vector
(called feature vector), and then take their dot product as the similarity score. In this
assignment, we will see two ways to do this called "string kernels" and "mismatch string
kernels". Don't be frightened by these terms! We will explain them and you will find that
it's not hard to do the calculations.
AATGCCGACCA
is a DNA sequence with alphabet {A,T,G,C}
). Given a set of N
sequences,
the Kernel Matrix K
is defined as an N*N
matrix where
the (i,j)-th
element, i.e. K(i,j)
, is the dot product between
the numerical feature vectors of sequence i
and sequence j
.
For this, we need the definition of the feature vector of a sequence and the dot product.
As we will see, the naïve way of representing the feature vectors by real vectors is impractical
due to the huge size of vectors, so will show you how to use a special Tree structure
(called Trie) to store feature vectors and compute their dot products efficiently.
x=(3,0,5,0,2,0,3,2,0,2)
and
y=(2,4,4,1,0,3,2,5,3,0)
is
3*2 + 0*4 + 5*4 + 0*1 + 2*0 + 3*2 + 2*5 + 0*3 + 2*0 = 42
.
For very large vectors with mostly zero elements, this calculation can be made more
efficient if we somehow store ONLY nonzero elements of vectors and have a way of
identifying the coordinate in the vector that correspond to each such element. More on this later.
String kernel features for sequences are defined by counting the occurrence times of
all possible subsequences with length k called k-mer (also called kmer). For example,
if the alphabet has two letters {C, G}
and k = 3
,
all possible kmers with length k
being 3
are:
[CCC, CCG, CGC, CGG, GCC, GCG, GGC, GGG]
.
We can use these kmers (here we have 8 kmers in total)
as coordinates, and the coordinate value at each coordinate is the number
of times of that particular kmer occurs in the sequence. For example, for the sequence
"CGGCCGCC"
, the 8-dimensional feature vector of this sequence under
the above coordinate
system is [0, 1, 1, 1, 2, 0, 1, 0]
(because CCC
does not occur in the sequence, CCG
occurs once, etc).
In the above simple example, the length of the kmer is 3 and the size of the
alphabet is 2.
Actually,
for real sequences such as protein sequences, the size of the alphabet is 20
(protein sequences are composed of 20 Amino Acids); if we want to consider
kmers with length 7, we will have totally 207 possible kmers hence the feature
vector has 1,280,000,000 elements. If we want to compute the dot product of two
protein sequences with length 20 and 27, respectively, we must count the occurrence
times of all 207 kmers and then compute their dot products using the brute-force approach.
This is prohibitively expensive! Note that sequences are often short and almost all the values
in the vectors will be zero (why?). Instead, if we count ONLY the occurrence of the kmers
But how do we
store just the kmers in a sequence compactly and how do we compute the dot
product between two sequences? Good question! We can create a list of Kmers with their count
in each sequence and traverse both lists to find identical ones and do the multiplication.
While this is much faster than the naïve way, it is still slow considering we need to do it
many times to calculate the kernel matrix. In this assignment we use use a Tree data
structure called Trie (see this link for a defintion of
Trie; note that for us,
since we store kmers with equal length, all leaves will be at the same depth).
We will build a special Trie for each sequence to store its kmers and their number of occurrences.
Then we will be able to traverse each Trie using all kmers in the another sequence to calculate
their dot product. Here is an example:
If the Alphabet is {CG}
, the full Trie schema for all kmers with length
k = 2
is:
There are 4 leaf nodes and 3 internal nodes in this Trie. Each leaf corresponds to one kmer,
for example the left most leaf is the kmer
root
/ \
C G
/ \ / \
C G C G
CC
. Note that we do not store letters in the
internal node; instead since the alphabet is fixed every internal node has that many children and
the position each child implicitly indicates the letter.
The Trie for any sequence is a sub-tree of this Trie (only containing kmers that
appear in the sequence) where each leaf contains the kmer and its number of occurrences in the sequence.
For example, the Trie for sequence "CCCG"
is:
Now if you have another sequence say
root
/ \
C null
/ \
C G
CC,2 CG,1
GCCCCCG
its list of nonzero
kmers is {(GC,1), (CC,4), (CG,1)}
. If we traverse the first trie
using each of these kmers and multiply the counts and sum all of these values
we get the dot product: GC
is not in the trie so it has 0 contribution to the calculations;
CC
is in the trie with count 2, multiplying this by 4 we get 8;
CG
has count 1 in the trie, multiplied by 1 we get 1.
So, in total we have 0+8+1=9 as the dot product.
In the above method, we talked about string kernels by counting all possible '
kmers with length k
to calculate feature vectors.
However, the exact counting may not effectively capture similarity.
For example, assuming the alphabet is {CG}
and the length
k
of kmer is 3, if we have two sequences "CCGC"
and "GCCC"
. The dot product of them is 0 using the string
kernel calculation (why?). But these two sequences look like very similar!
They have a common "CC"
and both have the letter 'G'
.
How can we modify the string kernel to take into account such similarities?
The answer is simple: allow mismatches when we count the coordinate kmers in the sequence.
We use m
to denote the maximum number of mismatches allowed.
If the number of mismatches of corresponding letters
between a kmer in a sequence and a coordinate kmer is smaller or equal to
m
, we still think that kmer
hits (matches) that coordinate. For alphabet {CG}
and k=3
,
the coordinate kmers are [CCC, CCG, CGC, CGG, GCC, GCG, GGC, GGG]
,
if we don't consider mismatches, the feature vectors of "CCGC"
and "GCCC"
are
[0, 1, 1, 0, 0, 0, 0, 0]
and [1, 0, 0, 0, 1, 0, 0, 0]
,
and their dot product is 0. If we consider
mismatches with m = 1
when we compare a kmer in the sequence
to a coordinate kmer, the answer is very
different. The first coordinate kmer "CCC"
has two hits in sequence
"CCGC"
since the number of mismatch between
"CCG"
and "CCC"
is 1 (which is <=m
) and
we have one hit for CCG
, and the number of mismatch between
"CGC"
and "CCC"
is also 1
(again <=m
) and we have just one hit for CGC
.
So the coordinate value for sequence "CCGC"
at coordinate
"CCC" is 2. Following the above procedure, we can calculate the feature vector
of "CCGC"
to be:
[1+1, 1, 1, 1+1, 1+1, 0, 1, 1, 0] = [2, 1, 1, 2, 2, 0, 1, 1, 0]
.
Likewise, the feature vector of "GCCC"
when considering mismatches
with m=1
is:
[1+1, 1, 1, 0, 1+1, 1, 1, 0]= [2, 1, 1, 0, 2, 1, 1, 0].
Then the dot products between "CCGC"
and "GCCC"
when considering mismatches with m=1
is:
2*2 + 1*1+ 1*1 + 2*0 + 2*2 + 0*2 + 1*1 + 1*1 + 0*0 = 12.
This result reflects our intuition about the similarity better.
How do we implement Mismatch String using a Trie? It turns out
it's very straightforwards. All we need to do is the following:
when we are adding a kmer to the trie we also insert all other kmers with mismatch
less than or equal to what's allowed. The dot product calculation remains the same.
N
sequences, we will need to compute the
dot products between every possible pair including the pair of each sequence
with itself, and store them in a matrix denoted by K
(this matrix is called
often 'Kernel Matrix' in Machine Learning literature). If the input sequence
set is in a file with one line corresponding to one sequence
as follows:
seq 0
seq 1
seq 2
seq 3
...
seq N-1
then K(i, j)
should be the dot product between
seq i
and seq j
. For example, K(1, 1)
is the dot product between seq 1
and itself.
We have provided you with KernelMatirx.java
. You need to
complete the calcKernelMatrix
method (simply by using methods in KmerTrie).
TrieNode, LeafNode, InternalNode, TrieException, and Alphabet
classes (DO NOT CHANGE THESE).
KMerTrie
class (a lot of code and wrappers are given,
you just complete the following methods (using proper helpers if necessary):
- KmerTrie constructor,
- addKmerWithoutMismatch,
- addKmerWithMismatch,
- dotProduct,
- getKmerCount,
- getListofKmers
KernelMatrix
class (again most of the code is given,
just implement calcKernelMatrix
).