The following is known as the "substring
searching" problem. You are given two strings, S and S', which are
composed of characters from an alphabet,
and you have to find the index of the first occurrence
of S in S'.
For example: find the index of the first occurrence of fun in programmingisafunactivity.
In this case, the answer is 14.
The substring (S) we are searching for is known as the search string, and the string we are searching (S') is known as the text string. In the above example, the search string is fun and the text string is programmingisafunactivity.
Substring Searching with a DFA
There are a number of ways to solve the substring
searching problem. The naive approach (detailed in section 7.6.2 of
your text) involves checking for the search string starting at each
character in the text string until the search string is found or the
end of the text string is reached. This method, however, is extremely
inefficient. Another approach involves using a special type of graph,
known as a
Deterministic Finite Automaton (DFA). This approach is much more
efficient, as it solves the problem in linear time with respect to the
size of the text string.
(This doesn't include the cost of first building the DFA, however.)
A DFA represents the search string pattern as a graph where each vertex in the graph is a state, which identifies how much of the search string pattern has been recognized, and each edge in the graph identifies a transition for one or more characters in the alphabet.
For an example of a DFA, see Figure 7.40 on page 337 of your text. Keep in mind that this figure illustrates the DFA for the search string ACATA and the alphabet being used is composed of the letters A, C, G and T. In this example DFA, vertex 3 and edge C transitions to vertex 2.
Note that every vertex (with the exception of the terminating vertex) must have a transition edge for each letter in the alphabet.
In order to find a search string in a text string
using a
DFA, initialize the DFA by setting its current state to the start state
(state 0). After the DFA is initialized, scan through the text
string character by character, updating the state of the DFA after each
character is read by following the appropriate transition edge. Each
state of the DFA represents the portion of the search string that has
been matched so far. Initially you start in state 0, meaning that no
part of the search string has been matched yet. After you encounter the
first character of the search string in the text string, you enter
state 1, which means the first character has been matched. If the next
character in the text string is the same as the next character in the
search string, you move to state 2, indicating that you've matched the
first two characters, and so forth. In general, if you are in state i,
this means that you will have matched the first i characters in the
search string. If you enter the DFA's final state before reaching the
end of the text string, then the search string is a substring of the
text string. Otherwise, the search string is not a substring of the text string.
Representing DFAs
In lecture we discussed two different ways for representing graphs:
adjacency matrices and adjacency lists. In an adjacency matrix, each
row and column of the matrix represents a node in the graph. An entry
at position (i,j) represents an edge from node i to node j. If entry
(i,j) is 1, then this means that there's an edge from i to j in the
graph, and if it's 0 then there is no edge from i to j. In an adjacency
list, each node maintains a list of nodes to which it is adjacent. This
representation is more efficient than an adjacency matrix, since space
is used only for the edges that are actually present in a graph.
One way that we could represent the DFA is using a
standard adjacency matrix. The characters used for transitioning from
node i to node j are stored as a set in entry (i,j); if there's no edge
(and hence no way to transition) from node i to node j, then the set is
empty. This representation leads to some efficiency issues in the way
that we use the DFA, however. Given a state in the DFA, and a
transition character, how do we find the next state after a transition?
Using the adjacency matrix just described, we would have to search
through every entry in row i of the matrix until we found an entry that
contained the transition character. This isn't particularly
efficient. Trying to represent a DFA using a standard adjacency list
approach also has problems.
You only need as many columns in your transition matrix as there are characters in your alphabet. For example, given an alphabet { A, B, C, D } and a graph with 3 vertices, the transition matrix would look like this (excluding the matrix entries that represent the vertex you transition to):
|
A |
B |
C |
D |
0 |
|
|
|
|
1 |
|
|
|
|
2 |
|
|
|
|
Constructing the DFA
n =
length(search_string)
construct a transition matrix so that it has n+1 rows and a column for
each character in the alphabet
initialize each entry in the transition matrix to be 0
mark vertex 0 as the start state and vertex n as the final state
for i in range(n+1):
for each c in the alphabet:
k = min (n, i+1)
while
search_string[:k] is not a suffix of search_string[:i]+c:
k = k - 1
transition_matrix[i][c] = k
The DFA class
Your job is to implement substring searching using
the DFA approach. The starter code is in DFA.py. This module contains a
class DFA, which defines the following methods (that you have to
implement):
In this part, you will implement code to find the
shortest path between two vertices in an unweighted graph. Graphs can
be directed or undirected. Furthermore, edges in the graph are labelled
with characters. Consider the example graph below:
By labelling the edges in the graph with
characters, we can associate a word with any path between two vertices.
More precisely, given a path from v1 to v2, the path's associated word
is composed of the letters that label the edges on the path from v1 to
v2.
Your job will be to find the word spelled out by the characters
labelling edges on a shortest path between two vertices. For example,
in the graph above, the shortest path from vertex 2 to 1 goes through
vertex 5 and spells out the word IE. If there is more than one shortest
path between vertices, you have to find the path whose associated word
is lexicographically the smallest. For example, the path from vertex 0
to vertex 2 can go either through vertex 1 or vertex 3 (i.e., the paths
0,1,2 and 0,3,2 are both shortest paths from vertex 0 to vertex 2).
However, the path through vertex 1 spells out AD, which is
lexicographically smaller than the word spelled out by the path through
vertex 3, KJ. (For our purposes, a word w_1 is 'lexicographically
smaller' than w_2 if w_1 would appear in the dictionary before w_2.)
Graphs are represented using adjacency lists. The implementation is
provided for you in the Graph class in graph.py. It is a very minor
variant of the adjacency list implementation we saw in lecture. In each
vertex v's adjacency list, instead of simply storing the adjacent
vertices, we store 2-element tuples, where the first element is an
adjacent vertex v2, and the second element is the character labelling
the edge from v to v2. For example, in the adjacency list of
vertex 0 in the above graph, we store [(1,'A'), (3,'K')].
Your code will go in the file pathfinder.py. You
have to implement the following method, which is defined in the starter
code:
Hints/Tips/Assumptions
Just a friendly reminder about proper BBS usage. We encourage you to
post questions on the bulletin board if you are having problems, but
please do not post questions (or answers) that reveal any part of your
solution (whether or not you think it's correct). This applies to all
parts of the assignment, no matter how small (e.g., testing if a string
is a suffix of another).
Also, to ensure that questions are properly answered by the TA in
advance of the deadline and that no last minute confusion arises,
please post your questions by no later than July 31st, 4pm.
Summary of Deliverables
The following is a list of files you need to submit and the methods that are defined in the starter code. Most of these methods still need to be implemented by you. Be careful to not change any method definitions or attribute names in the starter code. Also be careful to obey the instructions with regards to what you should be returning from these functions, and what kinds of arguments the functions expect. This is important for automarking. You are free to add any methods/functions/classes (or any other files) that you wish.
DFA.py - YOU MUST
SUBMIT THIS FILE
pathfinder.py -
YOU MUST SUBMIT THIS FILE
testDFA.py - YOU
MUST SUBMIT THIS FILE
Be sure to start this assignment early!