=========================================================================== CSC 324 Lecture Notes for Week 2 Winter 2009 =========================================================================== Compilation generally involves: - lexical analysis: break up input into "tokens" (meaningful groups), e.g., "temp := x + 1" - syntax analysis: group tokens into meaningful units (expressions, statements, declarations, etc.), e.g., "temp := x + 1" is assignment statement -- "parsing" phase - semantic analysis: check types and declarations ("semantic" used to refer to "context" of expressions and statements, not meanings) - intermediate code generation - optimization, e.g., eliminate common subexpressions, eliminate unreachable statements, etc. - final code generation ------ Syntax ------ Informal specification may lead to incompatible dialects. For example, "everything between /* and */ is a comment": /* Explanation of what to do... */ x = 3; */ y = 17 * x; Modern programming language syntax usually specified formally. Chomsky's Hierarchy: regular, context-free, context-sensitive, prase-structure grammars. Named after linguist Noam Chomsky, used for natural languages. Regular expressions and context-free grammars known from CSC236 (and basics reviewed in tutorial). In this course, not concerned with proving formal properties of grammars, but with understanding how to use them. Regular Grammars: like CFG but restricted so that either all productions are left-recursive or all productions are right-recursive. - left-recursive: A -> B a | b - right-recursive: A -> a B | b Regular grammars generate exactly regular languages. - Regular grammars used to describe linear structures (e.g., form of floating-point literals). - CFGs used to describe nested constructs (e.g., matching parentheses). Notation for CFG's called "Backus-Naur Form" (BNF). "Extended" BNF allows additional notation (for convenience, no change to expressive power): - { blah } denotes zero of more repetitions of blah - [ blah ] denotes 0 or 1 repetition of blah (blah is optional) - + denotes one or more repetition - numeric superscript denotes maximum number of repetitions - () used for grouping Derivations and parse trees - "Derivation" of s = sequence of application of productions to generate string s, e.g., for grammar ::= . ::= | ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 this is a derivation for string "3.141": -> . -> . -> 3 . -> 3 . -> 3 . -> 3 . -> 3 . 1 -> 3 . 1 4 -> 3 . 1 4 1 String s belongs to language iff there is a derivation of s. - "Parse tree" of s = structure within string s in language, e.g. for string s = "3.141" above. In general, . root = start symbol, . all leaves are terminals, . all internal nodes are non-terminals whose children make up one production. "Parsing" = process or producing parse tree. Note: string s belongs to language iff there is a parse tree for s. - Derivation != parse tree, e.g., string s = "3.141" has unique parse tree but multiple derivations (each corresponds to one traversal of parse tree). Ambiguity: If grammar G has more than one parse tree for string s, then G is "ambiguous" (and s is "ambiguous string"). Examples: - G = { S ::= S+S | S*S | (S) | x } and string "x+x*x". - G = { ::= if then | if then else , ...} Some languages intrinsically ambiguous, i.e., every CFG for the language is ambiguous -- example? Goal: grammar for language is unambiguous. Q: Is this always possible? (A: Later...) Work-around: try to avoid ambiguity from the start. Two strategies: - change language to include delimiters - change grammar to impose precedence and associativity Example: arithmetic expressions ::= + | - | * | / | ^ | | ::= ... (alphanumeric identifiers) ::= ... (integer literals) Adding delimiters: ::= ( + ) | ( - ) | ( * ) | ( / ) | ( ^ ) | | - "8 - 3 * 2" not a valid string anymore "(8 - (3 * 2))" and "((8 - 3) * 2)" are both valid (and unambiguous) - disadvantage: changes language, not as natural Impose precedence and associativity: - Usual mathematical conventions: . lowest precedence: + and - . next precedence: * and / . next precedence: ^ . highest precedence: () - Introduce new non-terminal for each precedence level, low to high (intuition: parse low-precedence first, high-precedence last): ::= + | - | ::= * | / | ::= ^ | ::= ( ) | | - Parse tree grouping now follows precedence, e.g., "8 - 3 * 2". - Problem: "3 - 2 - 1", "3 ^ 2 ^ 4" still ambiguous. Worse, values differ depending on parse tree (operator not commutative), so only one parse tree correct. - Solution: determine associativity -- usual mathematical conventions: . left-associative: +, -, *, / (i.e., a * b * c = (a * b) * c) . right-associative: ^ (i.e., a^b^c = a^(b^c)) - Rewrite to impose associativity -- recursive term on the left for left-associative, on the right for right-associative: ::= + | - | ::= * | / | ::= ^ | ::= ( ) | | - Parse tree grouping now implicitly follows associativity. Dealing with ambiguity - Determining if a grammar is ambiguous is *undecidable*, i.e., no algorithm can give correct answer for all grammars. - Some languages "inherently" ambiguous, i.e., every grammar for language is ambiguous. Example: { a^i b^j c^k : i = j or j = k }. Limitations of CFGs - CFGs cannot describe every language, e.g., { a^i b^i c^i : i >= 0 }. - Consequence for programming languages: cannot capture constructs such as: . must declare identifier before using it . cannot declare same identifier twice in same block . A[i,j] valid only if A is two-dimensional . number of actual parameters must equal number of formal parameters . etc. Nevertheless, CFGs capture most essential elements of syntactic analysis.