Languages ========= A language L over an alphabet A is a set of (finite length) strings of symbols from A. Regular expressions =================== Describe (some?!) languages. Start with finite alphabet of symbols. e (epsilon): empty string a symbol: that symbol Alternation/union r+s: string described by r or s Concatenation rs: strings described by r concatenated with strings described by s E.g. (x+y)z(x+e): xzx, xz, yzx, yx Repetition/Kleene* r*: strings described by r repeated 0 or more times E.g. (x+y)*z(x+e): zx, xz, yxxyzx, etc. Order of operations: * (as usual: unary before non-unary), rs, r+s (think 'multiply' before 'add') Let's express: floating-point number in Java string of 0s and 1s, with odd number of 1s string of 0s and 1s, with at most two 1s properly nested (balanced) brackets hmm, seems tough, maybe impossible? DFSA ==== Machine with finite amount of memory, processes a string character by character, responds with yes (accept) or no (reject). Can think of as like a Java program of the form: public boolean m(String s) { // Declare variables to track state ... for (int i = 0; i < s.length(); ++i) { char c = s.charAt(i); // update state variables based only on // their current state // c } return ... // some combination of state variables } The int i should not be considered limited by size of ints. But variables used to track state must be considered limited (taking a bounded amount of space). Alphabet: finite character set. State: a memory configuration (note there are a finite number of states: 2 ^ # bits). The start/initial state. The accepting (yes) states. Transition function: maps current state + next character to next state. E.g. machine to accept strings of 0s and 1s with odd number of 1s: Alphabet: {0, 1} States: even number of 1s so far (S1) odd number of 1s so far (S2) Start: S1 Accepting states: S2 Transition function: d(S1,0) = S1 d(S1,1) = S2 d(S2,0) = S2 d(S2,1) = S1 Picture: ... Convention: don't *draw* dead-end states. E.g. at most two 1s ... How many states necessary? Consider e, 1, 11, 111. If entire string is e, 1 or 11 then accepted. If entire string is 111 then rejected. So 111 enters different state than e, 1 and 11. 11 enters different state than e and 1, since 11 followed by 1 enters the non-accepting state for 111, but e or 1 followed by 1 enters accepting state(s) for 1 or 11. 1 enters different state than e, since 1 followed by 11 enters the non-accepting state for 111, but e followed by 11 enters accepting state for 11. So at least four states. NFSA ==== Parallel processing DFSA: May transition into multiple states (including spontaneous e-transition). Transition function: maps one of the current states + next character to *set* of next states Note: DFSA -> NFSA, since don't have to use the extra freedom. E.g. (11+110)*0 by DFSA, NFSA: ... Turning NFSA into DFSA - the Subset Construction. Use possible unions of states in the NFSA as states for DFSA. Equivalence =========== Definition: A language is regular iff it's describable by a RE. We left out one RE in the description of REs above: the empty RE. It matches no string, and is useful technically. Theorem: If a language is accepted by some NFSA then it's also accepted by some DFSA. Proof: by subset construction. Theorem (Kleene): If a language is accepted by a DFSA then it's regular. Proof: text. Theorem (Kleene): If a language is regular then it's accepted by an NFSA. Proof: REs built from empty, e, symbol, concatenation, union, repetition. Structural induction on REs. empty, e, symbol: fairly easy. RS: Make new start state s. If s_R is accepting, make s accepting. Duplicate transitions out of s_R, have them come out of s. Duplicate transitions into accepting states of R, have them go into s_S. Make none of the R states accepting anymore. R+S: Make new start state s. If s_R or s_S is accepting, make s accepting. Duplicate transitions out of s_R and s_S, have them come out of s. (Exercise: why not combine s_R and s_S?) R*: Make new start state s. Make it accepting. Duplicate transitions out of s_R, have them come out of s. Duplicate transitions into accepting states of R, have them go into s_R. If we had used e-transitions then we could have made the constructions a bit simpler: can you see how? In general (as in the proof) e-transitions are not necessary: they can be systematically eliminated. These three theorems form a classic circle of implications (NFSA->DFSA->Regular->NFSA), showing all three conditions for a language are equivalent. Regular languages ================= Let A be the alphabet. {}, A* (set of all strings) and finite subsets of A are all regular. Suppose L1 and L2 are regular languages (over A). Let regular expressions R1 and R2 describe them. Let M1 and M2 be DFSAs recognizing them. L1 U L2 is also regular: Described by a RE: R1 + R2. A*-L1 (complement of L1) is also regular: Accepted by M1 with accept/reject states flipped. Interesting: not so obvious how to convert R1 to RE for A*-L1. L1 intersect L2 also regular: apply deMorgans law, use union and complement. Exercise: build recognizing DFSA M from M1 and M2. L1L2 = {s1s2: s1 in L1 and s2 in L2} also regular. Described by a RE: R1R2. L1* = {s1s2...sn: n in N and all si in L1} also regular. Described by a RE: R1*.