=========================================================================== CSC 263H Lecture Outline for Week 6 Winter 2004 =========================================================================== [[Q: denotes a question that you should think about and that will be answered during lecture.]] ------- Hashing [ Section 2.5 ] ------- Problem 1: Read a text file, keeping track of the number of occurrences of each character (ASCII codes 0 - 127). Solution? A direct-address table: simply keep track of the number of occurrences of each character in an array with 128 positions (one position for each character). [[Q: What is the complexity of SEARCH? INSERT? DELETE? What is the memory usage? ]] Problem 2: Read a data file, keeping track of the number of occurrences of each integer value (from 0 to 2^{32}-1). Solution? It would be extremely wasteful (maybe even impossible) to keep an array with 2^{32} positions, especially when the data files may contain no more than 10^5 different values (out of all the 2^{32} possibilities). So instead, we will allocate an array with 10,000 positions (for example), and figure out a way to map each integer we encounter to one of those positions. This is called "hashing". Definition: Given a universe of keys U (the set of all keys possible), we will allocate a "hash table" T containing m positions (m will usually be chosen based on the application). We also need to define a "hash function" h : U -> {0,1,...,m-1} (that maps keys to positions in the hash table) so that for each possible key k in U, k will be stored in T at position h(k). [[Q: If the set of all keys was the set of all possible integer values (from 0 to 2^{32}-1), give some possible hash functions if m = 1,024 (2^{10}).]] The intention is that when we want to access key k, instead of looking in T[k] (as in the example with characters), we will look in T[h(k)] (so the hash function h tells us how to find the position in the table that corresponds to k, when the number of positions is smaller than the number of keys). [[Q: When two keys x =/= y hash to the same location (h(x) = h(y)), we say that they are in "collision". Would it be possible to set a hash function so that you could be sure you would have no collisions? How or why not?]] [[Q: Consider a hash table where each location could hold b keys. Suppose we had b items already in a location (bucket) and another item (b+1) hashed to the same location. What choices do we have about how to store this last item? Hint: Think about what you do in your personal phone book when you have too many friends whose name begin with the same letter (say "W").]] --------------- Open Addressing --------------- Open addressing collision-handling strategies use a predetermined rule to calculate a sequence of buckets A_0,A_1,A_2,... into which the scheme would attempt to store an item. This list of possible buckets is called a "probe sequence". The item is stored in the first bucket along its probe sequence which isn't already full. - Linear Probing: The easiest open addressing strategy is linear-probing. For n buckets, key k and hash function h(k), the probe sequence is calculated as: A_i = (h(k) + i) mod n for i = 0,1,2,... Note at A_0 (the home bucket for the item) is h(k) since h(k) should map to a value between 0 and n-1. [[Q: Work though an example where the h(k)= k mod 11, n = 11 and each bucket holds only one key. Insert the keys 26,21,5,36,13,16,15 (in that order.)]] [[Q: What is the problem with linear probing?]] [[Q: How could we change the probing so that two items that hash to different home buckets don't end up with nearly identical probe sequences? ]] - Non-Linear Probing: Non-linear probing includes schemes where the probe sequence does not involve steps of fixed size. Consider quadratic probing where the probe sequence is calculated as: A_i = (h(k) + i^2 ) mod n for i = 0,1,2,... [[Q: Work though an example where the h(k)= k mod 11, n = 11 and each bucket holds only one key. Insert the keys 26,21,5,36,13,16,15 (in that order.)]] Probe sequences will still be identical for elements that hash to the same home bucket. - Double Hashing: In double hashing (another open addressing scheme) we use a different hash function h_2(k) to calculate the step size. The probe sequence is: A_i = (h(k) + j * h_2(k)) mod n for j = 0,1,2,... Notice that it is important that h_2(k) =/= 0 for any key k. [[Q: Why? What other choices for h_2(k) would be poor? ]] ----------------- Closed Addressing ----------------- - Chaining: Instead of storing each key k directly at location h(k), we store a simple unordered singly-linked list of keys at each location in table T (so that we can accomodate collisions by storing each key that hashes to the same location in one linked list). Pictorially, the situation would look like the diagram below, if we had inserted keys k1,k2,k3,k4,k5,k6 such that h(k1) = h(k4) = h(k6) = 2, h(k2) = h(k5) = 1, and h(k3) = m-1 (where we use '/' to represent a NIL link and we only draw "forward" links). T [[Q: What is the worst-case running --- time of operations on such a 0 |/| hash table? ]] --- ------ ------ 1 |*-->|k5|*-->|k2|/| . INSERT(x)? --- ------ ------ ------ 2 |*-->|k6|*-->|k4|*-->|k1|/| --- ------ ------ ------ : --- . DELETE(x)? m-2 |/| --- ------ m-1 |*-->|k3|/| --- ------ What about SEARCH? For once, this is a date structure where SEARCH is the "tricky" operation, as opposed to INSERT or DELETE! We will look at the complexity for SEARCH next week...