=========================================================================== CSC 236 Lecture Summary for Week 12 Fall 2007 =========================================================================== (Proof that RE equivalent to NFA equivalent to DFA, continued.) RE -> NFA: Idea: start with FSA "-> q_0 --R-->(q)" then break up R into component pieces, adding states and transitions as necessary, until each transition labelled by single symbols. Will need two results about RE's: 1. For each RE R, either R == {} or R == R' for some R' that does not contain symbol {}. 2. For each RE R, either R == R' or R == R' + e for some R' that does not contains symbol e. Both proved by induction on number of operators in R (exercise). Let R be any RE. - If R == {}, A = "-> q_0" is equivalent. - If R == e, A = "->(q_0)" is equivalent. - Else, let R == R' or R == R' + e where R' does not contain symbol {} or symbol e. - If R == R', start with A = "-> q_0 -R'->(q)"; if R == R' + e, start with A = "->(q_0)-R'->(q)". - Repeat until each transition labelled with one symbol: . transitions q -S+T-> q' become q -S,T-> q' (with convention this represents two transitions, one labelled S the other T) -- for loops, q' = q; . transitions q -ST-> q' become q -S-> n -T-> q', where n is new non-accepting state (for loops, q' = q); . loops q -S*-> q become q -S-> q; . transitions q -S*-> q' (for q' != q) become ------ ------- -T_1-> / / / --> q -S-> n S q' ... \ \ \ ------ ------- -T_k-> where T_1,...,T_k are all transitions out of q' (including any loops), n is new state, q, n are accepting if q' is accepting, and q' is removed if it has no more incoming transitions. If q -S*-> q' was only transition out of q (including loops), this can be simplified to ------- -T_1-> / / --> q S q' ... \ \ ------- -T_k-> - Example: R = (0+1)* (00+11) (0+1)*. Does not contain {} or e, so start with: (0+1)* (00+11) (0+1)* --> q_0 ---------------------->(q_1) Deal with top-level operators (concatenation) to get: (0+1)* 00+11 (0+1)* --> q_0 -------> q_2 ------> q_3 ------->(q_1) No transition out of q_1 so deal with last transition next: (0+1)* 00+11 --> q_0 -------> q_2 ------>(q_3) 0+1 (q_1) Remove q_1 (no incoming transition) and deal with first transition: 00+11 --> q_0 0+1 q_2 ------>(q_3) 0+1 \___________________/| 00+11 Remove q_2 (no incoming transition), replace '+' with multiple transitions: 00 0,1 ----> --> q_0 (q_3) 0,1 ----> 11 Deal with both concatenations: 0 0 0,1 ---> q_1 ---> --> q_0 (q_3) 0,1 ---> q_2 ---> 1 1 - More detailed example (showing more features of dealing with *): R = 1 (00 + (1+00)*) (1+00) 1 1 (00 + (1 + 00)*) (1 + 00) 1 --> q_0 --------------------------->(q_1) Deal with concatenations first: 1 (00 + (1+00)*) (1+00) 1 --> q_0 ---> q_2 ----------------> q_3 --------> q_4 --->(q_1) Deal with top-level unions, breaking up concatenations at same time: 1 (1 + 00)* 1 1 --> q_0 ---> q_2 ------------> q_3 -------> q_4 --->(q_1) \_ _/| \ /| 0\| /0 0\| /0 q_5 q_6 Now, apply construction to (1+00)*: ___________1___________ / \ 1 / 0 0 1 \| 1 --> q_0 ---> q_2 ---> q_5 ---> q_3 -------> q_4 --->(q_1) | \________________ \0 /| |\ 1+00|/ 0 \|\| /0 | q_7 ------------------> q_6 | 1+00 \_____________1____________/ The last step (not shown -- too hard to draw in ASCII :), would be to eliminate the unions and concatenation in the two transitions labelled "1+00" (this would introduce two new states). ------------------ Closure Properties ------------------ Given regular languages L_1, L_2 over alphabet S, what language operations yield a regular language? L_1 and L_2 regular means there are RE's R_1, R_2 such that L(R_i) = L_i, and also DFA A_1, A_2 such that L_(A_i) = L_i. Union, Concatenation, Star: Obvious from RE's: L(R_1 + R_2) = L(R_1) U L(R_2) = L_1 U L_2 L(R_1 R_2) = L(R_1).L(R_2) = L_1.L_2 L(R_1*) = L(R_1)* = L_1* Complement: Complement of L_1: ~L_1 = S* - L_1. Consider DFA A'_1 = A_1 except the status of each state is changed (non-accepting states become accepting; accepting states become non-accepting) -- formally, if A_1 = (Q,S,q_0,F,d), then A'_1 = (Q,S,q_0,Q-F,d). Then, A'_1 accepts exactly the strings not accepted by A_1, i.e., L(A'_1) = ~L_1. Intersection: L_1 intersect L_2 = ~(~L_1 U ~L_2); since union and complement preserve regularity, so does intersection. Others: - rev(L_1) = { s^R : s in L }, where s^R is reversal of string s (written "backward") is regular. - prefix(L_1) = { s in S* : s.t in L_1 for some t in S* } is regular. - suffix(L_1) = { s in S* : t.s in L_1 for some t in S* } is regular. --------------------- Non-regular languages --------------------- Intuition: FSA have fixed, finite memory (states), so cannot "remember" unlimited information about strings. Example: The language L = { 0^n 1^n : n in N } = {e,01,0011,000111,...} is not regular. Proof: For a contradiction, suppose that L is regular. Then, there is a DFA A such that L(A) = L. Let k be the number of states in A (so A's states are {q_0,q_1,...,q_{k-1}}, and consider the behaviour of A on input string 0^{k+1} 1^{k+1}: -> q_i_0 -0-> q_i_1 -0-> q_i_2 -0-> ... -0-> q_i_{k+1} -1-> q_i_{k+2} -1-> ... -1-> q_i_{2k+2} where q_i_0,q_i_1,...,q_i_{2k+2} are states of A and q_i_{2k+2} is acceping. Since A contains only k states, some state of A must be repeated among q_i_0,...,q_i_{k+1}, i.e., there must be some a < b <= k+1 such that q_i_a = q_i_b -- schematically, the sequence of states that A goes through on string 0^{k+1} looks like this: q_i_{b-1} <-...- q_i_{a+1} \_ 0/| 0\| / -> q_i_0 -0-> ... -0-> q_i_a -0-> q_i_{b+1} -0-> ... -0-> q_i_{k+1} But then, the behaviour of A on input string 0^{k+1+(b-a)} 1^{k+1} will be the same as on input 0^{k+1} 1^{k+1}, i.e., A accepts some strings that are not in L! This contradicts our assumption that L(A) = L, so there can be no such DFA, i.e., L is not regular. This idea behind this example can be used to prove what is known as the "pumping lemma" (for regular languages): For all regular languages L, there is some k in N such that for every string s in L whose length is at least k, there exist strings u,v,w such that s = u.v.w, v is not empty, |uv| <= k, and u.v^i.w in L for all i in N. ---------------------- Context-free languages ---------------------- We study two formalisms: context-free grammars (generalization of regular expressions) and pushdown automata (generalization of FSA). --------------------- Context-free grammars --------------------- A more powerful way to describe sets of strings, based on the idea of "productions" (replacing variables by strings of symbols). Used to represent many natural-language constructs, and to describe all modern programming languages. Example: S -> 0S1 S -> e - "S -> 0S1" and "S -> e" are productions, of the form "variable" -> "string", where strings contain arbitrary mixture of variables and input symbols (called "terminals", because they cannot be further substituted for). - "S" is special "start variable". - "Derivation" consists of applying productions repeatedly to one variable at a time, starting from start variable, until no more productions can be applied, e.g., S => 0S1 => 00S11 => 000S111 => 000e111 = 000111 - Language generated = set of strings consisting only of terminals that can be derived from start variable.