=========================================================================== CSC 263H Lecture Outline for Week 7 Winter 2004 =========================================================================== [[Q: denotes a question that you should think about and that will be answered during lecture. ]] -------------------- Analysis of Chaining -------------------- Worst-case running time for SEARCH when using chaining to resolve collisions: For any fixed hash function, as long as |U| > m(n-1), it's possible to pick n keys that all hash to the same bucket, so the worst-case running time of SEARCH is Theta(n). (Note that if |U| <= m(n-1), there are hash functions that will put no more than n-1 items in each location of the hash table, while if |U| > m(n-1), then at least one location _must_ contain at least n items.) Average-case running time for SEARCH when using chaining to resolve collisions: - To simplify the analysis, we make the assumption that any key is equally likely to hash to any bucket ("simple uniform hashing"), i.e., Pr[h(x)=i] = SUM Pr[x] = 1/m for i = 0, 1, ..., m-1. x in U h(x)=i As a consequence, the expected number of items that hash to each bucket is the same. - We define the "load factor" 'a' as the expected number of items in each bucket. [[Q: Under the assumption of simple uniform hashing, what is a? ]] - Random variables? T(x) = # elements examined when searching for x. - Let L_i be the number of elements in bucket i, and n = SUM L_i. - Probability space? Pick x uniformly at random from U. - Under the assumption of simple uniform hashing, Pr[h(x)=i] = 1/m, so E[T] = SUM ( Pr[x] * T(x) ) x in U m-1 = SUM SUM ( Pr[x] * T[x] ) i=0 x in U h(x)=i m-1 1 m-1 n <= SUM ( Pr[h(x)=i] * L_i ) = - SUM L_i = - = a. i=0 m i=0 m since T[x] <= L_i if h(x) = i, and in the second line, we simply partitioned the universe U into subsets according to which bucket the keys map to. - Problem: Since |U| >> m, when we pick x uniformly at random from U, Pr[x in T] is very small. So the analysis we just carried out essentially tells us the expected time to search when the key is most likely _not_ in the hash table. This expected time is intuitively correct: if we are searching for a key that is not in the hash table, we need to traverse one complete linked list whose average size is a. - Solution: Change probability space to pick x uniformly at random from the elements in T. Then, Pr[h(x)=i] = L_i/n, and the probability that x is the j-th element in bucket i, conditional to the fact that h(x)=i, is simply 1/L_i (it is also uniform). E[T] = SUM Pr[x]*T(x) x in T m-1 ( L_i ) = SUM ( Pr[h(x)=i] * SUM ( j * Pr[x is j-th in slot i] ) ) i=0 ( j=1 ) m-1 L_i 1 m-1 L_i = SUM ( L_i/n * SUM j/L_i ) = - * SUM SUM j i=0 j=1 n i=0 j=1 1 m-1 1 m-1 1 m-1 = - SUM L_i(L_i+1)/2 = -- SUM (L_i)^2 + -- SUM L_i n i=0 2n i=0 2n i=0 1 m-1 = -- SUM (L_i)^2 + 1/2 2n i=0 Unfortunately, no further simplification is possible. However, under the assumption of simple uniform hashing, we know that the average value of L_i is a=n/m, so m-1 E[T] = 1/2n SUM (n/m)^2 + 1/2 = n^2/2nm + 1/2 = (a+1)/2. i=0 ------------------------------------------ Analysis of Operations for Open Addressing ------------------------------------------ For any well-designed open addressing scheme, the probe sequence for a key k will be some permutation of the indices [0,1,...,m-1] (for badly designed formulas, the probe sequence may not include every index). Assumption: the probe sequence is equally likely to be any one of the m! permutations of [0,1,...,m-1]. As before, we compute two slightly different expectations: the expected number of probes performed during an unsuccessful search, and the expected number of probes performed during a successful search. Unsuccessful search: Let T denote the number of probes performed. Let A_i denote the event that the i-th probe finds an occupied bucket. Then, T >= i iff A_1 and A_2 and ... and A__{i-1} occurs, so Pr[T >= i] = Pr[A_1 intersect A_2 intersect ... intersect A_{i-1}] = Pr[A_1] * Pr[A_1 | A_2] * Pr[A_3 | A_1 intersect A_2] * ... * Pr[A_{i-1} | A_1 intersect ... intersect A_{i-2}] For j >= 1, Pr[A_j | A_1 intersect ... intersect A_{j-1}] = (n-j+1)/(m-j+1) because we have to find one of the remaining n-j+1 elements among one of the remaining m-j+1 buckets, each with equal probability. [[Q: What is Pr[A_1]? ]] Hence, Pr[T >= i] = n/m * (n-1)/(m-1) * ... * (n-i+2)/(m-i+2) <= (n/m)^{i-1} = a^{i-1} and m oo E[T] = SUM Pr[T >= i] <= SUM a^{i-1} <= 1 + a + a^2 + a^3 + ... i=1 i=1 1 m = --- = --- (if a < 1). 1-a m-n [[Q: What is the expected number of probes if the table is half full? If it is 90% full? ]] Note that as long as a is a constant, E[T] will be a constant also. Successful search: Consider searching for a key k already in the table. [[Q: Think about the probe sequence used to insert k into the table, and the probe sequence used to find k later. What can we say about the two sequences? ]] [[Q: If k was the (i+1)st key inserted, then by the previous analysis, what was the expected number of probes made during insertion? ]] If we assume that every key currently in the table is equally likely to be searched for, we can just take the average over all keys to get: 1 n-1 m m n-1 1 m E[T] = - SUM --- = - SUM --- = - (H_m - H_{m-n}) n i=0 m-i n i=0 m-i n m m ( m ) 1 ( 1 ) approx - (ln(m) - ln(m-n)) = - ln (---) = - ln (---) n n (m-n) a (1-a) (where H_n is the n-th harmonic number: 1 + 1/2 + 1/3 + ... + 1/n and H_n is asymptotically equal to ln n). ----------------- Hashing functions ----------------- Two issues: - mapping keys k into integers (k could be a string or other type of object that can be ordered); - mapping the integer values for keys into the range [0..m-1] where m is the size of the hash table. The second issue is a little easier. Given an arbitrary integer x, we can map it into the range [0..m-1] by simply using mod: x mod m. [[Q: What are some of the problems this may cause? Think of collisions.]] For this reason, m is usually chosen to be a prime number. Patterns are still possible but less likely. For keys that are not integers, we still have to come up with an integer hash code. For example, if k is a string, one standard method is to add up the character codes of the individual characters in k. In other words, if k is made up of characters x_0 x_1 x_2 ... x_{k-1}, then we would compute the number x_0 + x_1 + ... + x_{k-1} (assuming each character is stored as a numerical code to start with). [[Q: What is one disadvantage of this method? ]] To prevent this, a common method is called a "polynomial hash code": make the position of each component of k count in the computation by choosing some nonzero constant a and computing x_0 + x_1 a + x_2 a^2 + ... + x_{k-1} a^{k-1}. This can be evaluated more efficiently using Horner's method: x_0 + a(x_1 + a(x_2 + ... + a(x_{k-2} + a x_{k-1}) ... )). --------------- Dynamic Hashing --------------- Notice that E[T] is always proportional to the load factor (a). [[Q: How could we decrease the load factor when E[T] is getting unacceptably large? ]] Consider a hashing scheme which has m buckets numbered 0 to m-1. Now add a new bucket (m) but do not change anything else. [[Q: Will this help decrease the load factor? Why? What will happen? ]] The hash function must be changed so some of the keys map to bucket m. Consider changing the hash function to h(k) = k mod (m+1). [[Q: What is wrong with doing that? ]] ------------------- Incremental Hashing ------------------- Incremental hashing is a dynamic hashing scheme which allows the hash table to grow by one bucket at a time but doesn't require rehashing of all elements currently in the table. Initially the table is constructed with m buckets (0 .. m-1) and the hash function is h(k) = k mod m. Use chaining as the collision strategy for now. (We could later extend this with minor modifications to use open addressing.) When the performance is unacceptable the table is extended by adding 1 bucket. [[Q: What measures could be used to evaluate acceptable performace or determine that it is time to grow? ]] The first time the table grows the new bucket is m. All records in bucket 0 are rehashed using a modified hash function. h_2(k) = k mod 2m [[Q: Can we be certain that everything that was previously in bucket 0 will have h_2(k) in the range (0 .. m)? ]] Now when we insert an item, we first apply h(k). If h(k) <= 0 we apply h_2(k) to determine the home bucket otherwise the home bucket is h(k). When it is time to "grow" again we "split" bucket 1 by extending the table to bucket m+1 and rehashing all the elements in bucket 1 using h_2(k). Note that 'm' remains constant throughout all of this (i.e., even though we are adding buckets to the table, the value of m remains the same; we must have some other means of keeping track of the number of buckets being used, such as a separate variable 'b'). [[Q: Explain the process to now insert an item with key k. ]] When we have done m "splits" we will have 2m buckets (0 .. (2m-1)) and all home buckets will be determined by h_2(k) = k mod 2m. [[Q: If we had to grow again, which bucket would split and what would be the new hash function? ]] ------------------ Entensible Hashing ------------------ This is another dynamic hashing scheme which uses a directory to index the buckets and progressively more bits of the hashed key to index into the directory. Essentially, the hash function is used to generate a binary number for each key, and the first i bits of these binary numbers are used as indices into a "directory" (the hash table), where i is the smallest number such that the first i bits of all keys are distinct. For example, consider the diagram below, where the elements to insert have had some function applied to their keys to generate binary representations. h(k1) = 0010100101110 h(k2) = 1110110101100 h(k3) = 1001001010010 The first two bits of the keys are enough to distinguish between all three (00 for k1, 11 for k2, 10 for k3) so we get the following directory: directory --------- | 00 |---\ |---------| ----> bucket A (contains k1) | 01 |---/ |---------| | 10 |--------> bucket B (contains k3) |---------| | 11 |--------> bucket C (contains k2) --------- The first bit is used to determine which half of the directory each key will go into (k1 goes into the top half because it starts with 0, k2 and k3 go into the second half because they start with 1). The second bit is used to further separate the keys: because k1 is the only key in the first half, both locations are used to point to the same bucket. During insertion, if the hash value for the new element has its first i bits different from all the ones currently in the table, it simply gets inserted at the right position. Otherwise, use as many more bits as necessary to make each key have a unique location. [[Q: Suppose for the sake of this example that each bucket holds only a single record. Draw a diagram of the situation when h(k4) = 0111111111111 is inserted. ]] Now consider the insertion of h(k5) = 110000000000 into the original table (without k4). The first two bits are not unique (they conflict with k2) so we try to use the first 3 bits instead. This will double the size of the directory and result in the situation below: directory --------- | 000 |--\ |---------| \ | 001 |----\ |---------| ---> bucket A (contains k1) | 0l0 |----/ |---------| / | 011 |--/ |---------| | 100 |---\ |---------| ----> bucket B (contains k3) | 101 |---/ |---------| | 110 |--------> bucket D (contains k5) |---------| | 111 |--------> bucket C (contains k2) --------- [[Q: What would have happened if the new key to insert was h(k6) = 111111111111 ? How many buckets would there be and how large would the table be? ]]