next up previous
Next: November 27 Up: November lecture summary Previous: Optimal binary search trees


November 22

Current genetics and biochemistry have given prominence to the problem of quickly identifying patterns in sequences of amino acids and other biological objects. Clearly you'd want algorithms for rapidly matching strings and substrings, but other subtler patterns are important. Consider the problem of sequences, subsequences, and supersequences (this example if from Algorithms by Cormen, Leiserson, and Rivest).

We can consider strings to be sequences, for example "ACCTGAA" can be thought of as a sequence of characters with the first character 'A', the second character 'C', the fifth character 'G', and so on (I'm using a 1-origin instead of 0-origin for a reason that will become clear in a moment).

Suppose string X = "ACCTGAA" and string Y = "ATA". Then we say that Y is a subsequence of X (or X is a supersequence of Y), since we can produce Y by deleting characters 2,3, 5, and 7 from X (there's another deletion that also works). Clearly, by this rule, the empty sequence "" is a subsequence of any other sequence.

Given two sequences, $ X$ = $ \langle x_1, x_2, \ldots, x_m\rangle$, and $ Y$ = $ \langle y_1, y_2, \ldots, y_n\rangle$, what is a longest common subsequence of $ X$ and $ Y$? Brute force would mean we would list every subsequence of $ X$ -- each element is either in or out of each subsequence, leading to $ 2^m$ possibilities -- and then check whether each is a subsequence of $ Y$. Exponential algorithms get old really fast.

However, the problem has an optimal-substructure property. Define $ X_i$, the $ i$th prefix of $ X$, to be the sequence $ \langle x_1,
\ldots, x_i$ (the special case $ X_0$ is the empty sequence -- that's the reason for the 1-origin). Now we can make the following observation about any Longest Common Subsequence (LCS) $ Z$ = $ \langle
z_1, \ldots, z_k\rangle$ of $ X$ = $ \langle x_1, \ldots, x_m\rangle$ and $ Y$ = $ \langle y_1, \ldots, y_n\rangle$. If $ x_m = y_n$, then $ z_k = x_m = y_n$ and $ Z_{k-1}$ is an LCS of $ X_{m-1}$ and $ Y_{n-1}$. If $ x_m \not= y_n$ and $ z_k \not= x_m$, then $ Z$ is an LCS of $ X_{m-1}$ and $ Y_n$. Symmetrically, if $ x_m \not= y_n$ and $ z_k \not=
y_n$, then $ Z$ is an LCS of $ X_m$ and $ Y_{m-1}$.

Define a two-dimensional array $ c[i][j]$ == the length of an LCS of $ X_i$ and $ Y_j$. You have the following recursive formula:

$\displaystyle c[i][j] =
\begin{cases}
0 & i == 0 \text{ OR } j == 0 \\
c[i-1]...
...x(c[i][j-1], c[i-1][j] & i, j > 0 \text{ AND } x_i \not = y_j
\\
\end{cases}.
$

The recursive definition immediately gives us the row $ c[0][j]$, and column $ c[i][0]$, for $ 0\leq i \leq m$ and $ 0\leq j \leq n$. We then proceed from top-to-bottom and left-to-right to fill in $ c[i][j]$. You can write (exercise) a doubly-nested loop that iterates over $ i$ and $ j$ and finds, simultaneously, the length of an LCS of $ X_i$ and $ Y_j$ and the last character of such an LCS. This has complexity $ O(mn)$ -- much nicer than exponential.

You can also pose (and solve) the problem of a shortest common supersequence of $ X$ and $ Y$, in an analogous way.

A weirder (to me) result comes from asking: in how many distinct ways is $ Y$ a subsequence of $ X$. In other words, how many distinct deletions of letters from $ X$ yield $ Y$. In our previous example, X = "ACCTGAA", and Y = "ATA", we'd find that $ Y$ occurs as a subsequence of $ Y$ twice. The empty sequence occurs exactly once as a subsequence of any other sequence.

Counting these subsequence is clearly not possible. Suppose $ X$ is a sequence of 64 'A's and $ Y$ is a sequence of 32 'A's. Then $ Y$ occurs as a subsequence of $ X$ in as many ways as you can choose 32 'A's to delete: 64-choose-32 -- greater than $ 2^32$.

To make this concrete, suppose you apply to legally change your name to a sequence of characters taken from the set $ \{A, C, T,G\}$. You quietly obtain DNA samples from friends and acquaintances. Now you count the number of times your new name (let's call it TACCAG) occurs in other people's DNA, and you bill them a dollar for each use of your intellectual property. You might be fabulously rich, but you're faced with the prospect of an exponential algorithm to count the number of licenses to charge for.

But, optimal substructure comes to the rescue. If we denote your new name as $ X$, your friend's DNA sequence as $ Y$, and $ C[i][j]$ as the number of times $ X_i$ occurs as a subsequence of $ Y_j$, we get the following cases

  1. If $ i==0$, then $ C[i][j]$ is 1, since the empty sequence occurs exactly once as a subsequence of any sequence.
  2. If $ i>j$, then $ C[i][j]$ is 0, since a sequence can't be a subsequence of a shorter sequence.
  3. If $ j\geq i > 0$ AND $ X_i \not= Y_j$, then $ C[i][j]$ is the same as $ C[i][j-1]$.
  4. If $ j\geq i > 0$ AND $ X_i = Y_j$, then $ C[i][j]$ = $ C[i][j-1]$ + $ C[i-1][j-1]$, since you count the number times $ X_i$ occurs as a subsequence of $ Y_{j-1}$ (without using $ y_j$), plus the number of ways $ X_{i-1}$ occurs as a subsequence of $ Y_{j-1}$, and then match $ x_i$ with $ y_j$.
This also leads to a $ O(nm)$ algorithm. Start sending those bills today!


next up previous
Next: November 27 Up: November lecture summary Previous: Optimal binary search trees
Danny Heap 2002-12-16