November 22

Next: November 27 Up: November lecture summary Previous: Optimal binary search trees

November 22

Current genetics and biochemistry have given prominence to the problem of quickly identifying patterns in sequences of amino acids and other biological objects. Clearly you'd want algorithms for rapidly matching strings and substrings, but other subtler patterns are important. Consider the problem of sequences, subsequences, and supersequences (this example if from Algorithms by Cormen, Leiserson, and Rivest).

We can consider strings to be sequences, for example "ACCTGAA" can be thought of as a sequence of characters with the first character 'A', the second character 'C', the fifth character 'G', and so on (I'm using a 1-origin instead of 0-origin for a reason that will become clear in a moment).

Suppose string X = "ACCTGAA" and string Y = "ATA". Then we say that Y is a subsequence of X (or X is a supersequence of Y), since we can produce Y by deleting characters 2,3, 5, and 7 from X (there's another deletion that also works). Clearly, by this rule, the empty sequence "" is a subsequence of any other sequence.

Given two sequences, = $\langle x_1, x_2, \ldots, x_m\rangle$ , and = $\langle y_1, y_2, \ldots, y_n\rangle$ , what is a longest common subsequence of and ? Brute force would mean we would list every subsequence of -- each element is either in or out of each subsequence, leading to possibilities -- and then check whether each is a subsequence of . Exponential algorithms get old really fast.

However, the problem has an optimal-substructure property. Define , the th prefix of , to be the sequence $\langle x_1, \ldots, x_i$ (the special case is the empty sequence -- that's the reason for the 1-origin). Now we can make the following observation about any Longest Common Subsequence (LCS) = $\langle z_1, \ldots, z_k\rangle$ of = $\langle x_1, \ldots, x_m\rangle$ and = $\langle y_1, \ldots, y_n\rangle$ . If , then and $Z_{k-1}$ is an LCS of $X_{m-1}$ and $Y_{n-1}$ . If $x_m \not= y_n$ and $z_k \not= x_m$ , then is an LCS of $X_{m-1}$ and . Symmetrically, if $x_m \not= y_n$ and $z_k \not= y_n$ , then is an LCS of and $Y_{m-1}$ .

Define a two-dimensional array == the length of an LCS of and . You have the following recursive formula:

$\displaystyle c[i][j] = \begin{cases} 0 & i == 0 \text{ OR } j == 0 \\ c[i-1]... ...x(c[i][j-1], c[i-1][j] & i, j > 0 \text{ AND } x_i \not = y_j \\ \end{cases}.$

The recursive definition immediately gives us the row

, and column

, for $0\leq i \leq m$ and $0\leq j \leq n$ . We then proceed from top-to-bottom and left-to-right to fill in

. You can write (exercise) a doubly-nested loop that iterates over

and

and finds, simultaneously, the length of an LCS of

and

and the last character of such an LCS. This has complexity

-- much nicer than exponential.

You can also pose (and solve) the problem of a shortest common supersequence of and , in an analogous way.

A weirder (to me) result comes from asking: in how many distinct ways is a subsequence of . In other words, how many distinct deletions of letters from yield . In our previous example, X = "ACCTGAA", and Y = "ATA", we'd find that occurs as a subsequence of twice. The empty sequence occurs exactly once as a subsequence of any other sequence.

Counting these subsequence is clearly not possible. Suppose is a sequence of 64 'A's and is a sequence of 32 'A's. Then occurs as a subsequence of in as many ways as you can choose 32 'A's to delete: 64-choose-32 -- greater than .

To make this concrete, suppose you apply to legally change your name to a sequence of characters taken from the set $\{A, C, T,G\}$ . You quietly obtain DNA samples from friends and acquaintances. Now you count the number of times your new name (let's call it TACCAG) occurs in other people's DNA, and you bill them a dollar for each use of your intellectual property. You might be fabulously rich, but you're faced with the prospect of an exponential algorithm to count the number of licenses to charge for.

But, optimal substructure comes to the rescue. If we denote your new name as , your friend's DNA sequence as , and as the number of times occurs as a subsequence of , we get the following cases

If , then is 1, since the empty sequence occurs exactly once as a subsequence of any sequence.
If , then is 0, since a sequence can't be a subsequence of a shorter sequence.
If $j\geq i > 0$ AND $X_i \not= Y_j$ , then is the same as .
If $j\geq i > 0$ AND , then = + , since you count the number times occurs as a subsequence of $Y_{j-1}$ (without using ), plus the number of ways $X_{i-1}$ occurs as a subsequence of $Y_{j-1}$ , and then match with .

This also leads to a

algorithm. Start sending those bills today!

Next: November 27 Up: November lecture summary Previous: Optimal binary search trees

Danny Heap 2002-12-16