=========================================================================== CSC 236 Lecture Summary for Week 10 Fall 2007 =========================================================================== ------------------- Regular expressions ------------------- Regular expressions describe sets of strings using a small number of basic operators. The set of regular expressions (regexps) over alphabet S is defined as (with usual convention {} not in S, e not in S): - {} (empty set symbol), e (empty string symbol) are regexps; - a is a regexp for all symbols a in S; - if R and S are regexps, then so are: R+S (union) -- lowest precedence, RS (concatenation), R* (star) -- highest precedence. For each regexp R, define the language described by R (L(R)) as follows: - L({}) = {} - L(e) = {e} - L(a) = {a} for every symbol a in S - L(R+S) = L(R) U L(S) - L(RS) = L(R).L(S) - L(R*) = L(R)* Examples: - L(a+b) = {a,b} - L(ab) = {ab} - L((a+b)a) = {aa,ba} = L(aa+ba) - L(a*) = {e,a,aa,aaa,...} (zero or more repetitions of "a") - L(aa*) = {a,aa,aaa,...} = L(a*a) (one or more repetitions of "a") - L((ab)*) = {e,ab,abab,ababab,...} (zero or more repetitions of "ab") - L((a+b)*) = {e,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...} (zero or more repetitions of a's or b's, i.e., every string of a's and b's) - L(a*+b*) = {e,a,b,aa,bb,aaa,bbb,aaaa,bbbb,...} (every string consisting entirely of a's or entirely of b's) - L((a+b)(a+b)*) = {a,b,aa,ab,ba,bb,...} (every nonempty string of a's and b's) - L(a(ba+c)*) = {a,aba,ac,ababa,abac,acba,acc,...} - All strings of a's and b's that have the same first and last symbol: e + a + b + a(a+b)*a + b(a+b)*b - L = {all strings of a's and b's that contain at least one a}: (a+b)*a(a+b)* or b*a(a+b)* or (a+b)*ab* Regexps R and S are equivalent (denoted R == S) iff they represent the same language (L(R) = L(S)), e.g., b*a(a+b)* == (a+b)*ab*. General regexp equivalences: - R+S == S+R [commutativity] - (R+S)+T == R+(S+T) [associativity] - (RS)T == R(ST) [associativity] - R(S+T) == RS+RT [left distributivity] - (S+T)R == SR+TR [right distributivity] - R+{} == R [identity] - eR == R == Re [identity] - {}R == {} == R{} [annihilator] - R** == R* [idempotence] Example: We prove that L(b*a(a+b)*) = L = {all strings of a's and b's that contain at least one a}, by showing double inclusion (standard technique for proving set equality). . L(b*a(a+b)*) subset of L: Let s be an arbitrary string in L(b*a(a+b)*). This means that s=t.u.v for some strings t in L(b*), u in L(a), and v in L((a+b)*). Since there is only one string "a" in L(a), u=a so s=t.a.v and s is a string that contains at least one "a", so s in L. . L subset of L(b*a(a+b)*): Let s be an arbitrary string in L. This means that s contains at least one "a", so it contains a first occurrence of a and can be broken up into three substrings: s=r.a.t, where r is some string that contains no a (maybe empty), a is the first occurrence of a in s, and t is some string of a's and b's. But then, r is in L(b*), a is in L(a), and t is in L((a+b)*) so by definition, s=r.a.t is in L(b*a(a+b)*). --------------------- Finite state automata --------------------- - Simple models of computing devices used to analyze strings. A FSA has a fixed, finite set of "states", one of which is the "initial state" and some of which are "accepting" (or "final") states, as well as "transitions" from one state to another for each possible symbol of a string. The FSA starts in its initial state and processes a string one symbol at a time from left to right: for each symbol processed, the FSA switches states based on the latest input symbol and its current state, as specified by the transitions. Once the entire string has been processed, the FSA will either be in an "accepting" state (in which case the string is "accepted") or not (in which case the string is "rejected"). - Example: Simplified control mechanism for a vending machine that accepts only nickels (5c), dimes (10c) and quarters (25c), where everything costs exactly 30c and no change is ever given. Alphabet S = {n,d,q} (for "nickel", "dime", "quarter"), set of states = {0,5,10,15,20,25,30} (for amount of money put in so far; no need to keep track of excess since no change will be provided), and transitions are defined by following table (state across the top, input symbol down the side), with the initial state being "0" and the only accepting state being "30": | 0 5 10 15 20 25 30 ---+---------------------- n | 5 10 15 20 25 30 30 d | 10 15 20 25 30 30 30 q | 25 30 30 30 30 30 30 (How to read this table: current state specifies column, current input symbol specifies row, entry at that row and column is next state; e.g., if current state is 15 and input symbol is d, next state is entry at row "d" and column "15", which is 25.) Computation of FSA on input such as "dndd" proceeds as follows: state 0 -> process 'd' -> state 10 -> process 'n' -> state 15 -> process 'd' -> state 25 -> process 'd' -> state 30. Since the last state is accepting, the string "dndd" is accepted. - Formal definition: A FSA is a quintuple (Q,S,q_0,F,d) where Q is a finite set of states S is a finite alphabet (Q intersect S is empty) q_0 in Q is the initial state F subset of Q is the set of accepting ("final") states d : Q x S -> Q is a transition function (for each q in Q, a in S, d(q,a) is the next state of the FSA when processing symbol a from state q) - Transition function gives new state for each state and single input symbol. Extended transition function d*(q,w) gives new state for FSA after processing string w starting from state q. It can be defined recursively, as follows: { q if w = e (empty), d*(q,w) = { { d(d*(q,w'),a) if w = w'a for some w' in S* and a in S. Example (from before): d*(5,ndn) = d(d*(5,nd),n) = d(d(d*(5,n),d),n) = d(d(d(d*(5,e),n),d),n) = d(d(d(5,n),d),n) = d(d(10,d),n) = d(20,n) = 25 - A string w is "accepted" by a FSA A iff d*(q_0,w) in F; otherwise, w is "rejected". The language accepted by a FSA A is defined as L(A) = { w in S* : A accepts w (i.e., d*(q_0,w) in F) }. - Example: Come up with FSA that accepts L = { w in {a,b}* : w contains an even number of a's }. Use states that represent information about string processed so far. In this case, only need to remember if number of a's seen so far is even or odd, so only need two states "even" and "odd". Initial state should be "even" (since before reading any symbol, number of a's processed so far = 0 is even) and set of accepting states is simply {"even"}. To represent transition function, transition diagrams are a useful notation. Each state represented by a node (labelled with state), transitions represented by directed edges labelled with input symbol (i.e., d(q,a) = q' represented by edge from q to q' labelled with a). Initial state has "dangling" in-edge, accepting states have double circles for nodes (in ASCII picture below, accepting states will be surrounded with parentheses). a _ -----> |/ \ --> (even) odd | b <----- \_/ / |\ a \___/ b