=========================================================================== CSC 236 Lecture Summary for Week 13 Fall 2007 =========================================================================== --------------------- Context-free grammars --------------------- A more powerful way to describe sets of strings, based on the idea of "productions" (replacing variables by strings of symbols). Used to represent many natural-language constructs, and to describe all modern programming languages. Example: S -> 0S1 S -> e - "S -> 0S1" and "S -> e" are productions, of the form "variable" -> "string", where strings contain arbitrary mixture of variables and "terminals" (i.e., input symbols) -- so called because they cannot be further substituted for. - "S" is special "start variable". - "Derivation" consists of applying productions repeatedly to one variable at a time, starting from some string s made up of variables and/or terminals, e.g., S => 0S1 => 0.0S1.1 => 00.0S1.11 => 000.e.111 = 000111 In general, each step of derivation replaces one variable in current string according to one production for that variable. If t can be derived from s, we write s =>* t. - Language generated = set of strings consisting only of terminals that can be derived from start variable = { w in Sigma* : S =>* w }, where Sigma = { terminals }. For the example, language = { 0^n 1^n : n in N }. Note: - CFG's are inherently non-deterministic (more than one possible production can apply to one variable). Example 2: G = S -> SS, S -> (S), S -> e - Example derivation: S => SS => (S).S => (.e.)S => ().(S) => ()(.(S).) => ()((.e.)) = ()(()) - L(G) = { all balanced strings of parentheses } . All strings generated are "balanced" (contain same number of '(' and ')' and no prefix contains more ')' than '(') -- proved by induction on number of productionsn used to derive strings. . All such strings can be generated -- proved by induction on length of string. Previous examples shows CFG's can represent languages that are not regular! Can CFG's represent all regular languages? {} (or { S -> S }) represents {}, { S -> e } represents {e}, { S -> a } represents {a}, if G_1 represents R_1 and G_2 represents R_2, then { S -> S_1, S -> S_2, G_1, G_2 } represents R_1 + R_2, { S -> S_1 S_2, G_1, G_2 } represents R_1 R_2, { S -> e, S -> S_1 S, G_1 } represents R_1*. So CFG's are strictly more expressive than RE's. Ambiguity: If a grammar G can generate a string w in more than one way, then G is "ambiguous", e.g., for G = { S -> S+S, S -> S*S, S -> (S), S -> x }, the string "x+x*x" can be derived in many ways: S => S+S => S+S*S => x+S*S => x+x*S => x+x*x S => S*S => S+S*S => S+x*S => S+x*x => x+x*x S => S*S => S*x => S+S*x => x+S*x => x+x*x ... Some languages are intrinsically ambiguous, i.e., every CFG for the language is ambiguous, e.g., { a^n b^n c^m d^m : n >= 1, m >= 1 } U { a^n b^m c^m d^n : n >= 1, m >= 1 } ----------------- Pushdown Automata ----------------- Informal idea: Augment FSA with a stack: in one transition, PDA can push one symbol or pop one symbol from top of stack (or both, or neither). Notation for transitions: "a : X > Y" means "when reading input symbol a and top stack symbol X, push symbol Y onto stack". Allow a = e for transitions that manipulate stack without reading input symbol; allow X = e for transitions that do not pop stack; allow Y = e for transitions that do not push new symbol onto stack. Example: { 0^n 1^n : n in N } 0:e>0 1:0>e / \ / \ 0:e>0 \| / 1:0>e \| / -->(q_0)--------> q_1 -------->(q_2) - Convention: PDA accepts if possible to start from initial state with empty stack and end in accepting state with empty stack. - What happens if more 1's than 0's? Will attempt to pop empty stack while in state q_2. This is a "missing transition" (for 1:e), with same convention as before (meaning reject). - Note PDA are non-deterministic -- deterministic version not as expressive. For example, { w in {a,b}* : w is a palindrome } cannot be accepted by deterministic PDA but can be accepted by non-deterministic PDA. Example 2: { balanced parentheses } (:e>( ):(>e / \ \| / -->(q_0) => If this accepts, then each ')' matched some '(' and matches were exact (nothing left over), so string was balanced. <= If string is balanced, then each '(' eventually matched with ')', which will remove corresponding '(' from stack, so PDA will accept. ------------ Equivalences ------------ Theorem: For all CFG's G, there is a PDA A such that L(A) = L(G). Theorem: For all PDA's A, there is a CFG G such that L(G) = L(A). Proofs much more complicated than for FSA; see textbook. -------------------------- Non-context-free languages -------------------------- - Pumping Lemma for CFLs: For every CFL L, there is a constant n such that for all s in L with |s| >= n, it is possible to write s = v.w.x.y.z with |wy| > 0, |wxy| <= n, and v.w^m.x.y^m.z in L for all m in N. - { 0^n 1^n 2^n : n in N } is not context-free - { w.w : w in Sigma* } is not context-free ------ Beyond ------ Context "free" because productions are applied to variables without regards to string surrounding variable. Generalization is "context sensitive grammars" where productions have the form "string" -> "string", where both strings contain arbitrary mix of variables and terminals. Example: context sensitive grammar for { 0^n 1^n 2^n : n in N } S -> e, S -> 0XS2, X0 -> 0X, X1 -> 11, X2 -> 12 example derivation: S => 0XS2 => 0X0XS22 => 0X0X0XS222 => 0X0X0X222 => 0X00XX222 => 00X0XX222 => 000XXX222 => 000XX1222 => 000X11222 => 000111222 What about PDA's? Can be generalized to PDA with two stacks, or FSA with a queue (instead of a stack), etc. Many variations possible, some making a difference, others not. You'll learn more about all of this in CSC363!