=========================================================================== CSC 236 Lecture Summary for Week 9 Fall 2007 =========================================================================== -------------------------- Binary search, iteratively -------------------------- # Pre: A sorted and x comparable with A[0..n-1] (n = len(A)) # Post: 0 <= p <= n and A[0..p-1] < x <= A[p..n-1] What we know at start: A: |_____________________?_____________________| 0 n-1 What we want at end: A: |_______ < x ___________|______ >= x _______| 0 p n-1 In between? Maintain range to search [b..e]. Since A[b..e] are elements we don't know about yet, we must have A[0..b-1] < x <= A[e+1..n]. This will be our loop invariant! A: |___ < x ___|_______?_______|____ >= x _____| 0 b e n-1 Now, we can start writing loop. Initialization? Make "in between" picture same as start: 1. b = 0 2. e = n - 1 Loop condition? Continue as long as range [b..e] not empty, i.e., b <= e. When b > e, "in between" picture looks like end, which is what we want. 3. while b <= e: This is binary search, so compare middle element with x. Let's draw picture to get this right: A: |_ < x _|__________|= x __| 0 b m e n-1 A: |_ < x _|___________|>=x|_________|_ >= x __| 0 b m e n-1 4. m = (b + e) / 2 # integer division 5. if A[m] < x: b = m + 1 6. else: e = m - 1 Does this work? Possible to prove for all b <= e, b <= m <= e. This guarantees values of b, e at end of iteration satisfy b <= e+1. We add this to loop invariant. Value of p at end? According to picture, p = b = e+1. All together: # Pre: A sorted and x comparable with A[0..n-1] (n = len(A)) 1. b = 0 2. e = n - 1 # LI: 0 <= b <= e+1 <= n and A[0..b-1] < x <= A[e+1..n] 3. while b <= e: 4. m = (b + e) / 2 # integer division 5. if A[m] < x: b = m + 1 6. else: e = m - 1 7. p = b # Post: 0 <= p <= n and A[0..p-1] < x <= A[p..n-1] Wait! What about termination? Hopefully, e and b get closer to each other at each iteration. Actually, this must be e+1 and b from what we know of loop. - Initially, e+1-b = n-0 = n. - For any iteration, since b <= m <= e, e+1-(m+1) < e+1-m <= e+1-b and (m-1)+1-b < m+1-b <= e+1-b so either way, interval shrinks. ----------------- Back to MergeSort ----------------- MergeSort(A,b,e): 1. if b == e: return 2. m = (b + e) / 2 3. MergeSort(A,b,m) 4. MergeSort(A,m+1,e) 5. for i in [b..e]: B[i] = A[i] 6. c = b 7. d = m+1 At this point, we know B[b..m] contains A[b..m] sorted; B[m+1..e] contains A[m+1..e] sorted. Want loop with postcondition A'[b..e] contains A[b..e] sorted (where A' represents values after loop). What we know at start: B: |_________ sorted __________|__________ sorted _________| c m d e A: |__________________________ ? __________________________| b e What we want at end: B: |_________ sorted __________|__________ sorted _________| b m c e d A: |________________________ sorted _______________________| b e A[b..e] contains elements from B[b..m] and B[m+1..e] In between: Maintain index i of next element to assign to A. This means A[b..i-1] contains elements from B[b..c-1] and B[m+1..d-1] sorted, and also A[b..i-1] <= B[c..m], B[d..e]. B: |___________|_______________|_______|___________________| b c m m+1 d e A: |______ sorted _____|___________________________________| b i e A[b..i-1] contains elements from B[b..c-1] and B[m+1..d-1] and A[b..i-1] <= B[c..m], B[d..e] This is our loop invariant! LI = "A[b..i-1] contains elements from B[b..c-1] and B[m+1..d-1], sorted (i.e., A[b] <= ... <= A[i-1]), and A[b..i-1] <= B[c..m], A[b..i-1] <= B[d..e]." Sanity check: - Does this give us what we want at end? Yes, because i = e+1, c = m+1, d = e+1, so A[b..e] contains B[b..m] and B[m+1..e], sorted; also, B[m+1..m], B[e+1..e] are both empty. - Is this true at beginning? Yes: A[b..b-1], B[b..b-1] and B[m+1..m] are all empty. Now, write loop to ensure this remains true. # LI: A[b..i-1] contains B[b..c-1] and B[m+1..d-1], sorted, # and A[b..i-1] <= B[c..m], B[d..e]. 8. for i in [b..e]: We know either A[i] = B[c] or A[i] = B[d], but which one? - B[c] when d > e (no element remains in second half), or when c <= m and B[c] < B[d] (could be B[c] <= B[d]). - B[d] in all other cases (i.e., c > m or d <= e and B[c] >= B[d]). 9. if d > e or (c <= m and B[c] < B[d]): 10. A[i] = B[c] 11. c += 1 else: # d <= e and (c > m or B[c] >= B[d]) 12. A[i] = B[d] 13. d += 1 Does this preserve LI? Yes: in first case, A[b..i] contains B[b..c] and B[m+1..d-1], sorted (since B[c] < B[d] and A[i-1] <= B[c..m], B[d..e]); second case similar. ---------------------- Formal Language Theory ---------------------- Basic definitions and conventions: - "Alphabet": any finite, non-empty set of atomic symbols ("compound" symbols like "ab" not allowed), e.g., {a,b,c}, {0,1}, {+}. - "String": any finite sequence of symbols; empty sequence is allowed and denoted "e" (called "empty string"), e.g., a, ab, cccc are strings over {a,b,c}; e, ++++ are strings over {+}, a+00 is _not_ a string over {a,b,c} but it is over {0,1,+,a,b,c}. Convention: "e" not allowed as symbol in alphabet (to avoid confusion with empty string). - "Language": any set of strings (can be empty, finite, or infinite), e.g., {bab, bbabb, bbbabbb, ...} is a language over {a,b,c}, {+,++} is a language over {+}, {e} is a language over any alphabet, {} is a language over any alphabet. NOTE: {} is different from {e}: {} contains NO string, {e} contains ONE string (the empty string). - "Length" of string s, denoted |s|, is number of symbols in s, e.g., |bba| = 3, |+| = 1, |e| = 0. - Strings s, t are "equal" iff |s| = |t| and i-th symbol of s = i-th symbol of t for 1 <= i <= |s|. - "Reversal" of string s (denoted s^R) is string obtained by reversing symbols of s, e.g., 10110^R = 01101, +++^R = +++, e^R = e. - "Concatenation" of strings s and t (denoted st or s.t) consists of every symbol of s followed by every symbol of t, e.g., bba.bb = bbabb, e.+++ = +++. Convention: For string s, natural number k, s^k denotes s concatenated with itself k times, e.g., aba^2 = abaaba, ++^0 = e, e^3 = e. - Convention: For alphabet S, S^n denotes set of all strings of length n over S, and S^* denotes the set of all strings over S. E.g., {a,b,c}^0 = {e}, {0,1}^3 = {000,001,010,011,100,101,110,111}, {+}^* = {e,+,++,+++,++++,...} = { +^k : k in N }. Motivation: - Languages are a powerful abstraction: everything from logical formulas to compilation of programs can be studied using languages. The study of properties of languages is an important aspect of theoretical computer science and some of its applications, particularly the abstract problem of language "recognition": Given language L and string s, does s belong to L? This comes up in applications such as compiler design, where source code must go through "lexical analysis" (to break up source code into "tokens" that represent identifiers, function names, operators, etc.) and "parsing" (to determine whether source code has correct syntax), something you can study in CSC488. It is also central to the study of computational complexity and computability, which you will study in CSC363. - We will look at two ways to express languages: descriptive ways (regular expressions, context-free grammars), and procedural ways (finite state automata, pushdown automata). Operations on languages: for all languages L, L' over alphabet S, - complement: comp(L) = S* - L - union: L U L' - intersection: L inter L' - reversal: rev(L) = { s^R : s in L } e.g., rev({a,ab,abb}) = {a,ba,bba} - concatenation: L.L' = { s in S* : s = rt for r in L, t in L' } e.g., {a,bc}.{bb,c} = {abb,ac,bcbb,bcc}, for all L, L.{e} = L = {e}.L, L.{} = {} = {}.L, {a,aa,aaa,aaaa,...}.{b,bb,bbb,bbbb,...} = {ab,abb,abbb,...,aab,aabb,aabbb,...,aaab,aaabb,aaabbb,...} = { s in {a,b}* : s contains some number of a's followed by some number of b's, with at least one of each } = { a^m.b^n : m >= 1, n >= 1 } - exponentiation: L^k = L.L...L (k times) (s in L^k iff s = t_1.t_2...t_k for some t_1 in L, ... t_k in L) e.g., {b,ac}^3 = {bbb,bbac,bacb,bacac,acbb,acbac,acacb,acacac}, {+,+++,+++++}^0 = {e}, {e}^5 = {e}, {}^5 = {}, {}^0 = {e} - Kleene star: L* = L^0 U L^1 U L^2 U ... (s in L* iff s in L^k for some k iff s = e or s = t_1.t_2...t_k for some k and some t_1,t_2,...,t_k in L) e.g., {ab}* = {e,ab,abab,ababab,abababab,...}, {e}* = {e}, {}* = {e}, {b,ac}* = {e} U {b,ac} U {bb,bac,acb,acac} U {bbb,bbac,...} U ... = {e,b,ac,bb,bac,acb,acac,bbb,bbac,...}