Current genetics and biochemistry have given prominence to the problem of quickly identifying patterns in sequences of amino acids and other biological objects. Clearly you'd want algorithms for rapidly matching strings and substrings, but other subtler patterns are important. Consider the problem of sequences, subsequences, and supersequences (this example if from Algorithms by Cormen, Leiserson, and Rivest).
We can consider strings to be sequences, for example "ACCTGAA" can be thought of as a sequence of characters with the first character 'A', the second character 'C', the fifth character 'G', and so on (I'm using a 1-origin instead of 0-origin for a reason that will become clear in a moment).
Suppose string X = "ACCTGAA" and string Y = "ATA". Then we say that Y is a subsequence of X (or X is a supersequence of Y), since we can produce Y by deleting characters 2,3, 5, and 7 from X (there's another deletion that also works). Clearly, by this rule, the empty sequence "" is a subsequence of any other sequence.
Given two sequences, =
, and
=
, what is a longest
common subsequence of
and
? Brute force would mean we would
list every subsequence of
-- each element is either in or out of
each subsequence, leading to
possibilities -- and then check
whether each is a subsequence of
. Exponential algorithms get old
really fast.
However, the problem has an optimal-substructure property. Define
, the
th prefix of
, to be the sequence
(the special case
is the empty sequence -- that's
the reason for the 1-origin). Now we can make the following
observation about any Longest Common Subsequence (LCS)
=
of
=
and
=
. If
, then
and
is an LCS of
and
.
If
and
, then
is an LCS of
and
. Symmetrically, if
and
, then
is an LCS of
and
.
Define a two-dimensional array == the length of an LCS of
and
. You have the following recursive formula:
You can also pose (and solve) the problem of a shortest common
supersequence of and
, in an analogous way.
A weirder (to me) result comes from asking: in how many distinct ways
is a subsequence of
. In other words, how many distinct
deletions of letters from
yield
. In our previous example,
X = "ACCTGAA", and Y = "ATA", we'd find that
occurs as a subsequence of
twice. The empty sequence occurs
exactly once as a subsequence of any other sequence.
Counting these subsequence is clearly not possible. Suppose is a
sequence of 64 'A's and
is a sequence of 32 'A's. Then
occurs
as a subsequence of
in as many ways as you can choose 32 'A's to
delete: 64-choose-32 -- greater than
.
To make this concrete, suppose you apply to legally change your name
to a sequence of characters taken from the set
. You
quietly obtain DNA samples from friends and acquaintances. Now you
count the number of times your new name (let's call it
TACCAG) occurs in other people's DNA, and you bill them a
dollar for each use of your intellectual property. You might be
fabulously rich, but you're faced with the prospect of an exponential
algorithm to count the number of licenses to charge for.
But, optimal substructure comes to the rescue. If we denote your new
name as , your friend's DNA sequence as
, and
as the
number of times
occurs as a subsequence of
, we get the
following cases