=========================================================================== CSC 263H Lecture Outline for Week 5 Winter 2004 =========================================================================== [[Q: denotes a question that you should think about and that will be answered during lecture.]] -------------------------- Other Self-Balancing Trees -------------------------- Last week we introduced Red Black trees which were a binary variation on 2-3-4 trees where each node was coloured red or black and the notion of height depended on the black nodes of the tree. Black nodes were allowed to have 0, 1 or 2 red children but red nodes could not have red children. We saw how this could be related to 2-3-4 trees. This week we want to briefly look at a generalization of 2-3-4 trees called B-trees. Similar to 2-3-4 trees, B-trees are multiway trees with all leaves at the same level and a varying number of children per node. A B-tree node can hold at most m keys and pointers to m+1 children. The following properties must hold in a B-tree of order m. The root must hold at least 1 key and at most m keys. Each node must hold between floor(m/2) keys and m keys. All leaves must be at the same level. [[Q: How does a 2-3-4 tree relate to a B-tree.]] [[Q: Consider keeping a search tree on disk. How many disk accesses would it take to search a tree of height h? Why is this so important?]] When the OS reads from disk, it reads a minimum of 1 block of data from the disk. That means that if all you want is one 2-node from a 2-3-4 tree (possibly 12 bytes) the OS must still read a full block. It also has to spin the disk to the correct sector, move the head to the correct track etc. [[Q: So how could we design the tree better to take advantage of the fact that disk access is expensive but once we do the access we read at least one full block of data?]] Similar to 2-3-4 trees, B-trees have insertion and deletion algorithms that SPLIT and MERGE as necessary to maintain the properties. -------------------------- Augmenting Data Structures [Not in G&T Text. In CLRS chapter 14.] -------------------------- General Definition: - An "augmented" data structure is simply an existing data structure modified to store additional information and/or perform additional operations. - Generally, a data structure is augmented as follows: 1. Determine additional information that needs to be maintained. 2. Check that the additional information can be maintained during each of the original operations (and at what additional cost, if any). 3. Implement the new operations. Example: - We want a data structure that will allow us to answer two types of "rank" queries on sets of values, as well as having standard operations for maintaining the set (INSERT, DELETE, SEARCH): . RANK(k): Given a key k, what is its "rank", i.e., its position among the elements in the data structure? . SELECT(r): Given a rank r, what is the key with that rank? For example, if our set of values is {3,15,27,30,56}, then RANK(15) = 2 and SELECT(4) = 30. - Let's look at 3 different ways we could do this 1. Use 2-3-4 trees without modification. . Queries: Simply carry out an inorder traversal of the tree, keeping track of the number of nodes visited, until the desired rank or key is reached. [[Q: What will be the time for a query? ]] [[Q: Will the other operations (SEARCH/INSERT/DELETE) take any longer?]] [[Q: What is the problem? Could we do better? ]] 2. Augment 2-3-4 trees so that each node has an additional field 'rank[x]' that stores its rank in the tree. [[Q: What will be the time for a query? ]] [[Q: Will the other operations (SEARCH/INSERT/DELETE) take any longer?]] [[Q: What is the problem? Could we do better? ]] 3. Augment the tree in a more sophisticated way. Consider augmenting 2-3-4 trees so that each node x has an additional field 'size[x]' that stores the number of keys in the subtree rooted at x (including x itself). This may seem unrelated to 'rank' at first, but we'll see that it is enough to allow us to do what we want. - Queries: . Consider first a 2-node with a single element x. We know that rank[x] = 1 + number of keys that come before x in the tree. In particular, if we consider only the keys in the subtree rooted at x, then rank[x] = size[v_1] + 1 (this is NOT necessarily the true rank of x in the whole tree, only its "relative" rank among the keys in the subtree rooted at x). . Now consider a 4-node with elements x_1,x_2 and x_3, the "relative" rank of x_i is again 1 + number of keys that come before x in the tree. So, i rank[x_i] = SUM (size[v_j] + 1) j=1 . RANK(k): Given key k, perform SEARCH on k keeping track of "current rank" r: Each time you go down a level you must add the size of the subtrees to the left that you skipped. You must also add the key itself that you skipped. Notice that this is the "relative" rank of the key to the left of the subtree you are exploring. Reminder of the SEARCH algorithm from week 4: SEARCH(n,k): if n is a leaf: return NIL (k is not the in the tree) for i := 1 to d: if k = k_i: return x_i if k < k_i: return SEARCH(v_i,k) [[Q: When we recursively call SEARCH(v_i,k) what do we add to r?]] [[Q: When we find x how to we determine its true rank?]] . Note that we did not deal with degenerate cases (such as when k does not belong to the tree), but it is easy to modify the algorithm to treat those cases. . SELECT(r): Given rank r, start at x = root[T] and work down. Start with the left-most subtree that hasn't yet been explored (S), and compare size[S] + 1 to r. If they are equal return the element to the immediate right of the pointer to S in S's parent. If (r < size[S] +1), we know that the element we are looking for is in S, so call the routine recursively on S. If r > size[S] + 1, then we know the node we are looking for is in one of S's right siblings and that its relative rank in the remaining elements (ignoring S) is equal to r - (size[S] + 1), so we change r accordingly and go down the next sibling of S to the right. Once again, we did not deal with degenerate cases (such as when r is a rank that does not correspond to any node in the tree), but they are easily accomodated with small changes to the algorithm. [[Q: What will be the time for a query? ]] - Updates: INSERT and DELETE operations consist of two phases for 2-3-4 trees: the operation itself, followed by the fix-up process. We look at the operation phase first, and check the fix-up process only once afterwards. . INSERT(x): Simply increment the size of the subtree rooted at every node that is examined when finding the position of x (since x will be added in that subtree). . DELETE(x): Consider the element y that is actually removed by the operation (so y = x or y = successor(x)). We know the size of the subtree rooted at every node on the path from the root down to y decreases by 1, so we simply traverse that path to decrement the size of each node on that path. . SPLITS: Consider splitting a node and promoting a key. To find the size of the new nodes we will need to check the sizes of each of the subtrees. This is O(1). The size of the parent node (into which the promoted node is inserted) is not changed. . TRANSFERS: Remember that the deletion leaves a 1-node with no key and only one child and this requires us to borrow from a sibling. In the picture below ?? is the key which has been deleted (demoted). ___M___ ___L___ / \ / \ __K_____L__ ?? __K__ __M__ / | \ | / \ / \ s1 s2 s3 s4 s1 s2 s3 s4 The only size fields that have changed are those of the nodes containing K and ??. The size of the node containing K has decreased by size(s3)+1 and the size of the node containing ?? has increased by size(s3)+1. Note that the node KL must be at least a 3-node but it could also be a 4-node. This would not affect the size changes. . MERGES: It is easy to see that merges only affect the size of the resulting merged node and that the size is easily computed from the original sizes. Update time? We have only added a constant amount of extra work during the first phase of each operation, and during each SPLIT, TRANSFER, or MERGE, so the total time is still Theta(log n). - Now, we have finally achieved what we wanted: each operation (old or new) takes time Theta(log n) in the worst-case. ------- Hashing [ Section 2.5 ] ------- Problem 1: Read a text file, keeping track of the number of occurrences of each character (ASCII codes 0 - 127). Solution? A direct-address table: simply keep track of the number of occurrences of each character in an array with 128 positions (one position for each character). All operations are therefore Theta(1) and the memory usage is small. Problem 2: Read a data file, keeping track of the number of occurrences of each integer value (from 0 to 2^{32}-1). Solution? It would be extremely wasteful (maybe even impossible) to keep an array with 2^{32} positions, especially when the data files may contain no more than 10^5 different values (out of all the 2^{32} possibilities). So instead, we will allocate an array with 10,000 positions (for example), and figure out a way to map each integer we encounter to one of those positions. This is called "hashing".