=========================================================================== CSC 236 Lecture Summary for Week 13 Winter 2008 =========================================================================== Note: - CFG's are inherently non-deterministic (more than one possible production can apply to one variable). For example, with CFG S -> 0S1 | \epsilon, each step of derivation can choose between production S -> 0S1 and S -> \epsilon -- different choices yield different strings. Example 2: G = S -> SS | (S) | \epsilon - Example derivation: S => SS => (S).S => (.\epsilon.)S => ().(S) => ()(.(S).) => ()((.\epsilon.)) = ()(()) - Claim: L(G) = { all balanced strings of parentheses } . All strings generated are "balanced" (contain same number of '(' and ')' and no prefix contains more ')' than '(') -- proved by induction on number of productionsn used to derive strings. . All such strings can be generated -- proved by induction on length of string. - Detailed proof that L(G) = L_B (where L_B = { balanced parentheses }). . L(G) (_ L_B: By induction on the number of productions used, we prove that every string generated by G is balanced. BC: The only string that can be generated using only 1 production is \epsilon, and it is balanced. IH: For some k > 1, suppose all strings generated using fewer than k productions are balanced. IS: Consider a string generated using k productions. The first production is either S => SS or S => (S). In the first case, each S generates a balanced string of parentheses using fewer than k productions, by the IH, so the final string is also balanced. In the second case, S generates a balanced string of parentheses using k-1 productions, by the IH, so the final string is also balanced. In every case, the string produced is balanced. . L_B (_ L(G): By induction on the length of the string, we prove that every balanced string can be generated by G. BC: The only balanced string of length 0 is \epsilon, which can be generated by G. IH: Suppose k > 0 and every balanced string of length less than k can be generated by G. IS: Suppose s is a balanced string of length k. Since k > 0, s is not empty so it must start with '(', which matches with some ')' later in the string (since s is balanced). Thus, s = (u)v, where u and v are both balanced strings of parentheses of length smaller than s (maybe empty). By the IH, both u and v can be generated by G. If v is empty, then s can be generated by S => (S) followed by the derivation of u. If v is not empty, then s can be genereted by S => SS => (S)S followed by the derivations of u and v. In every case, s can be generated. Example 3: CFG for { s (- {a,b}* : s contains as many a's as b's } G = S -> aSbS | bSaS | \epsilon (S -> SaSbS | SbSaS | \epsilon would also work S -> SS | aSb | bSa | \epsilon would also work) - Example derivation: S => a.S.bS => aa.S.bSbS => aabSb.S => aabSbb.S.aS => aab.S.bbaSbSaS => aabaSbSbbaSbSaS =>* aababbbaba - L(G) = { all strings with same number of a's and b's } . All strings generated contains the same number of a's and b's: productions introduce them only in matching pairs. . All strings with the same number of a's and b's can be generated: Given such a string s, say s starts with an 'a' (similar argument works if s starts with 'b'). Consider the quantity "#a's - #b's" associated with each symbol of s -- e.g., if s = aabbab, the quantity corresponding to each symbol is 1, 2, 1, 0, 1, 0. Since s contains as many b's as a's, there must be a first symbol where #a's = #b's, i.e., it is possible to write s = a.u.b.v where u and v are strings that contain as many a's as b's. Then, s can be generated using the production S -> aSbS followed by productions to generate the shorter strings u, v. Previous examples shows CFG's can represent languages that are not regular! Can CFG's represent all regular languages? { S -> S } represents {}, { S -> \epsilon } represents {\epsilon}, { S -> a } represents {a}, if G_1 (with start variable S_1) represents R_1 and G_2 (with start variable S_2) represents R_2, then { S -> S_1, S -> S_2, G_1, G_2 } represents R_1 + R_2, { S -> S_1 S_2, G_1, G_2 } represents R_1 R_2, { S -> \epsilon, S -> S_1 S, G_1 } represents R_1*. So CFG's are strictly more expressive than RE's. Ambiguity: If a grammar G can generate a string w in more than one way, then G is "ambiguous", e.g., for G = { S -> S+S, S -> S*S, S -> (S), S -> x }, the string "x+x*x" can be derived in many ways: S => S+S => S+S*S => x+S*S => x+x*S => x+x*x S => S*S => S+S*S => S+x*S => S+x*x => x+x*x S => S*S => S*x => S+S*x => x+S*x => x+x*x ... Some languages are intrinsically ambiguous, i.e., every CFG for the language is ambiguous, e.g., { a^n b^n c^m d^m : n >= 1, m >= 1 } U { a^n b^m c^m d^n : n >= 1, m >= 1 } ----------------- Pushdown Automata ----------------- Informal idea: Augment FSA with a stack: in one transition, PDA can push one symbol or pop one symbol from top of stack (or both, or neither). Notation for transitions: "a : X > Y" means "when reading input symbol a and top stack symbol X, push symbol Y onto stack". Allow a = \epsilon for transitions that manipulate stack without reading input symbol; allow X = \epsilon for transitions that do not pop stack; allow Y = \epsilon for transitions that do not push new symbol onto stack. Convention: PDA "accepts" if possible to start from initial state with empty stack and end in accepting state with empty stack. Example: { 0^n 1^n : n (- N } 0:\epsilon>0 1:0>\epsilon / \ / \ 0:\epsilon>0 \| / 1:0>\epsilon \| / -->(q_0) -------------> q_1 -------------> (q_2) - If string contains "10", reject from q_0 or q_2. - If string contains more 0's and 1's, reject at q_2 because string finished but stack not empty. - If string contains more 1's than 0's, will attempt to pop empty stack while in state q_2. This is a "missing transition" (for 1:\epsilon), with same convention as before (meaning reject). - Note PDA are non-deterministic -- deterministic version not as natural or expressive. For example, { w in {a,b}* : w is a palindrome } cannot be accepted by deterministic PDA but can be accepted by non-deterministic PDA. Example 2: { balanced parentheses } (:\epsilon>( ):(>\epsilon / \ \| / -->(q_0) => If this accepts, then each ')' matched some '(' and matches were exact (nothing left over), so string was balanced. <= If string is balanced, then each '(' eventually matched with ')', which will remove corresponding '(' from stack, so PDA will accept. (Equivalently, only way for PDA to reject is to read ')' when stack is empty, meaning there was no matching '(' and string is not balanced.) ------------ Equivalences ------------ Theorem: For all CFG's G, there is a PDA A such that L(A) = L(G). Main idea: A has first transition to push start variable S onto stack. After that, transitions are of two types: pop terminal from top of stack while reading matching symbol from string, or pop variable from top of stack while reading no input symbol and push result of one production for the popped variable (which may require multiple states). This way, A accepts input iff there is some way to produce it. Theorem: For all PDA's A, there is a CFG G such that L(G) = L(A). Main idea: Have variables A_{p,q} to represent strings processed by PDA when moving from state p to state q. Use productions (generating strings )to represent transitions (processing strings). Proofs much more complicated than for FSA; see textbook for details. -------------------------- Non-context-free languages -------------------------- - Pumping Lemma for CFLs: For every CFL L, there is a constant k (- N such that for all s (- L with |s| >= k, it is possible to write s = v.w.x.y.z with |wy| > 0, |wxy| <= k, and v.w^m.x.y^m.z (- L for all m (- N. - { 0^n 1^n 2^n : n (- N } is not context-free (if it were, by the pumping lemma, we could pick string 0^n 1^n 2^n for n > k which means no matter how w, x, y are picked in pumping lemma, strings v.w^m.x.y^m.z will contain different numbers of 0's, 1's and 2's). - { w.w : w (- \Sigma* } is not context-free (argument more complicated but still possible). --------- Beyond... --------- Context "free" because productions are applied to variables without regards to string surrounding variable. Generalization is "context sensitive grammars" where productions have the form "string" -> "string", where both strings contain arbitrary mix of variables and terminals. Example: context sensitive grammar for { 0^n 1^n 2^n : n (- N } S -> 0XS2 | \epsilon X0 -> 0X, X1 -> 11, X2 -> 12 example derivation: S => 0XS2 => 0X0XS22 => 0X0X0XS222 => 0X0X0X222 => 0X00XX222 => 00X0XX222 => 000XXX222 => 000XX1222 => 000X11222 => 000111222 Challenge: context sensitive grammar for { w.w : w (- \Sigma* } What about PDA's? Can be generalized to PDA with two stacks, or FSA with a queue (instead of a stack), etc. Many variations possible, some making a difference, others not. You can learn more about all of this in CSC363 and CSC448. ------ Review ------ - induction: simple, complete, well ordering - recurrences: setting up, repeated substitution, proofs - recursive correctness - iterative correctness: loop invariants, partial correctness, termination - regular languages: REs, DFAs, NFAs, equivalences - non-regular languages - context-free languages: CFGs, PDAs