=========================================================================== CSC 236 Lecture Summary for Week 9 Winter 2008 =========================================================================== ----------------- Back to MergeSort ----------------- From previous lectures: MergeSort(A,b,e): 1. if b == e: return 2. m = (b + e) / 2 # integer division 3. MergeSort(A,b,m) 4. MergeSort(A,m+1,e) 5. for i in [b..e]: B[i] = A[i] 6. c = b 7. d = m+1 At this point, we know B[b..m] contains A[b..m] sorted; B[m+1..e] contains A[m+1..e] sorted (taking for granted that the loop on line 5 does what it is supposed to, which is obvious enough). Want loop with postcondition A'[b..e] contains A[b..e] sorted (where A' represents values after loop). Rather than justify loop from before, let's construct it using loop invariants as a guide. What we know at start: B: |_________ sorted __________|__________ sorted _________| c m d e A: |__________________________ ? __________________________| b e What we want at end: B: |_________ sorted __________|__________ sorted _________| b m c e d A: |________________________ sorted _______________________| b e and A[b..e] contains elements from B[b..m] and B[m+1..e] In between: Maintain index i of next element to assign to A. This means A[b..i-1] contains elements from B[b..c-1] and B[m+1..d-1] sorted, and also A[b..i-1] <= B[c..m], B[d..e]. B: |___________|_______________|_______|___________________| b c m m+1 d e A: |______ sorted _____|___________________________________| b i e A[b..i-1] contains elements from B[b..c-1] and B[m+1..d-1] and A[b..i-1] <= B[c..m], B[d..e] This is our loop invariant! LI = "A[b..i-1] contains elements from B[b..c-1] and B[m+1..d-1], sorted (i.e., A[b] <= ... <= A[i-1]), and A[b..i-1] <= B[c..m], A[b..i-1] <= B[d..e]." Sanity check: - Does this give us what we want at end? Yes, because i = e+1, c = m+1, d = e+1, so A[b..e] contains B[b..m] and B[m+1..e], sorted; also, B[m+1..m], B[e+1..e] both empty so second part of LI vacuously true. - Is this true at beginning? Yes: A[b..b-1], B[b..b-1] and B[m+1..m] are all empty, so LI vacuously true. Now, write loop to ensure this is invariant. Within body of loop, we know either A[i] = B[c] or A[i] = B[d], but which one? - Pick B[c] when d > e (no element remains in second half), or when c <= m (there are elements in first half) and B[c] < B[d] (could be B[c] <= B[d]). - Pick B[d] in all other cases (i.e., c > m or d <= e and B[c] >= B[d]). # LI: A[b..i-1] contains B[b..c-1] and B[m+1..d-1], sorted, # and A[b..i-1] <= B[c..m], B[d..e]. 8. for i in [b..e]: 9. if d > e or (c <= m and B[c] < B[d]): 10. A[i] = B[c] 11. c += 1 else: # d <= e and (c > m or B[c] >= B[d]) 12. A[i] = B[d] 13. d += 1 Does this preserve LI? Yes: in first case, A[b..i] at end of iteration contains B[b..c] and B[m+1..d-1], sorted (since B[c] < B[d] and A[i-1] <= B[c..m], B[d..e] by LI); second case similar. Termination? Simple counting loop: E = e+1-i decreases by 1 at each iteration (and remains >= 0 throughout). Partial correctness? Already verified from loop invariant, and loop invariant correct by design. ---------------------- Formal Language Theory ---------------------- Basic definitions and conventions: - "Alphabet": any finite, non-empty set of atomic symbols ("compound" symbols like "ab" not allowed, to prevent ambiguity), e.g., {a,b,c}, {0,1}, {+}. - "String": any *finite* sequence of symbols; empty sequence is allowed and denoted \epsilon (called "empty string"), e.g., a, ab, cccc are strings over {a,b,c}; \epsilon, ++++ are strings over {+}, a+00 is _not_ a string over {a,b,c} but it is over {0,1,+,a,b,c}. Convention: \epsilon not allowed as symbol in alphabet (to avoid confusion with empty string). - "Language": any set of strings (can be empty, finite, or infinite), e.g., {bab, bbabb, bbbabbb, ...} is a language over {a,b,c}, {+,++} is a language over {+}, {\epsilon} is a language over any alphabet, {} is a language over any alphabet. NOTE: {} is different from {\epsilon}: {} contains NO string, {\epsilon} contains ONE string (the empty string). - "Length" of string s, denoted |s|, is number of symbols in s, e.g., |bba| = 3, |+| = 1, |\epsilon| = 0. - Strings s, t are "equal" iff |s| = |t| and i-th symbol of s = i-th symbol of t for 1 <= i <= |s|. - "Reversal" of string s (denoted s^R) is string obtained by reversing symbols of s, e.g., 1011^R = 1101, +++^R = +++, \epsilon^R = \epsilon. - "Concatenation" of strings s and t (denoted st or s.t) consists of every symbol of s followed by every symbol of t, e.g., bba.bb = bbabb, \epsilon.+++ = +++. Convention: For string s, natural number k, s^k denotes s concatenated with itself k times, e.g., aba^2 = abaaba, ++^0 = \epsilon, \epsilon^3 = \epsilon. - Convention: For alphabet \Sigma, \Sigma^n denotes set of all strings of length n over \Sigma, and \Sigma* denotes set of all strings over \Sigma. E.g., {a,b,c}^0 = {\epsilon}, {0,1}^3 = {000,001,010,011,100,101,110,111}, {+}* = {\epsilon,+,++,+++,++++,...} = { +^k : k (- N }. Motivation: - Languages are a powerful abstraction: everything from logical formulas to compilation of programs can be studied using languages. The study of properties of languages is an important aspect of theoretical computer science and some of its applications, particularly the abstract problem of language "recognition": Given language L and string s, does s belong to L? This comes up in applications such as compiler design, where source code must go through "lexical analysis" (to break up source code into "tokens" that represent identifiers, function names, operators, etc.) and "parsing" (to determine whether source code has correct syntax), something you can study in CSC488. It is also central to the study of computational complexity and computability, which you will study in CSC363. - We will look at two ways to express languages: descriptive ways (regular expressions, context-free grammars), and procedural ways (finite state automata, pushdown automata). Operations on languages: for all languages L, L' over alphabet \Sigma (often expressed as: L, L' (_ \Sigma*), - complement: ~L = \Sigma* - L - union: L u L' - intersection: L n L' - reversal: rev(L) = { s^R : s (- L } e.g., rev({a,ab,abb}) = {a,ba,bba} - concatenation: L.L' = { s (- \Sigma* : s = rt for r (- L, t (- L' } e.g., {a,bc}.{bb,c} = {abb,ac,bcbb,bcc}, for all L, L.{\epsilon} = L = {\epsilon}.L, L.{} = {} = {}.L, {a,aa,aaa,aaaa,...}.{b,bb,bbb,bbbb,...} = {ab,abb,abbb,...,aab,aabb,aabbb,...,aaab,aaabb,aaabbb,...} = { s (- {a,b}* : s contains some number of a's followed by some number of b's, with at least one of each } = { a^m b^n : m >= 1, n >= 1 } - exponentiation: L^k = L.L...L (k times) (s (- L^k iff s = t_1.t_2...t_k for some t_1 (- L, ..., t_k (- L) e.g., {b,ac}^3 = {bbb,bbac,bacb,bacac,acbb,acbac,acacb,acacac}, {+,+++,+++++}^0 = {\epsilon}, {\epsilon}^5 = {\epsilon}, {}^5 = {}, {}^0 = {\epsilon} - Kleene star: L* = L^0 u L^1 u L^2 u ... (s (- L* iff s (- L^k for some k, equivalent to s = \epsilon or s = t_1.t_2...t_k for some k and some t_1,t_2,...,t_k (- L) e.g., {ab}* = {\epsilon,ab,abab,ababab,abababab,...}, {\epsilon}* = {\epsilon}, {}* = {\epsilon}, {b,ac}* = {\epsilon} u {b,ac} u {bb,bac,acb,acac} u {bbb,bbac,...} ... = {\epsilon,b,ac,bb,bac,acb,acac,bbb,bbac,...} ------------------- Regular expressions ------------------- Regular expressions describe sets of strings using a small number of basic operators. The set of regular expressions (regexps) over alphabet \Sigma is defined as (with usual convention {} !(- \Sigma, \epsilon !(- \Sigma): - {} (empty set symbol), \epsilon (empty string symbol) are regexps; - a is a regexp for all symbols a (- \Sigma; - if R and S are regexps, then so are: R+S (union) -- lowest precedence, RS (concatenation), R* (star) -- highest precedence. For each regexp R, define the language described by R (L(R)) as follows: - L({}) = {} - L(\epsilon) = {\epsilon} - L(a) = {a} for every symbol a (- \Sigma - L(R+S) = L(R) U L(S) - L(RS) = L(R).L(S) - L(R*) = L(R)*