=========================================================================== CSC 263H Addendum to Lecture Outline for Week 11 Winter 2004 =========================================================================== This is a revised version of the material from week 10 that was covered during week 11 on the St. George and UTM campuses. It contains the same material but written a little differently and with a little more detail in a few places. ------------- Disjoint Sets ------------- Remember that in each implementation considered below, we have direct access to each element from an outside source, and a way to keep track of any required additional information for each element (e.g., keep track of an index or pointer for each element). Data Structures for Disjoint Sets: 3. Linked list with pointer to representative and "union-by-weight". As before, except that we also keep track of the number of elements in each list, by adding a 'size' field to each node and storing the size of a list inside the node of the representative. MAKE-SET and FIND-SET are not affected (still take time O(1)), and when we perform UNION, we always append the smaller set to the longer one (so we have fewer pointers to change). This is called "union-by-weight" (the "weight" of a set is simply its size). Worst-case sequence complexity for m operations: let n be the number of MAKE-SET operations in the sequence (so there are never more than n elements in total). For some arbitrary element x, we want to prove an upper bound on the number of times that x's back pointer can be updated. Note that this happens only when the set that contains x is UNIONed with a set that is no smaller (because we only update back pointers for the smaller set). This means that each time x's back pointer is updated, the resulting set must have at least doubled in size. Since there are no more than n elements in total in all the sets, this means that x's back pointer cannot be updated more than lg(n) times ('lg' is log_2). And since this is true for every element x, the total number of pointer updates during the entire sequence of operations is O(n log n). The time for other operations is still O(1), and there are m operations in total, so the total time for the entire sequence is O(m + n log n). 4. Trees. Represent each set by a tree, where each element points to its parent only and the root points back to itself. The representative of a set is the root. Note that the trees are _not_ necessarily binary trees: the number of children of a node can be arbitrarily large (or small). . MAKE-SET(x) takes time O(1): just create a new tree with root x. . FIND-SET(x) takes time O(depth of x): simply follow "parent" pointers back to the root of x's tree. . UNION(x,y) takes time O(1) in addition to the time taken to perform FIND-SET(x) and FIND-SET(y): just make the root of the tree that contains y point to the root of the tree that contains x. Worst-case sequence complexity for m operations: just like for the linked list with pointers to representative but no size, we can create a tree that is one long chain with m/4 elements, so that FIND-SET takes time Omega(m); if we perform m/2 FIND-SET operations, we get a sequence whose total time is Omega(m^2). 5. Trees with "union-by-weight". As before, except we also keep track of the weight (i.e., size) of each tree (inside a 'size' field for each node, that stores the size of the subtree rooted at that node) and always append the smaller tree to the larger one when performing UNION (and update the size field of the new root). The complexity of the operations is unchanged. Worst-case sequence complexity for m operations: it is possible to show that during any sequence of m operations, n of which are MAKE-SET, the maximum height of any tree is O(log n). (The proof is by induction on the height h of the trees.) This means that the running time of any individual operation is O(log n), which gives total time of O(m log n) for the entire sequence. 6. Trees with path compression. When performing FIND-SET(x), keep track of the nodes visited on the path from x to the root of the tree (in a stack or queue), and once the root is found, update the parent pointers of each node to point directly to the root. (This can also be done very easily by using recursion instead of an auxiliary data structure.) This at most doubles the running time of the FIND-SET operation, but it can speed up future operations considerably. In fact, it is possible to prove (but we won't do it) that the worst-case running time of a single operation in a sequence, if there are n MAKE-SET operations (so at most n-1 UNIONs) and `f' FIND-SET operations is Theta( f log n / log(1+f/n) ) if f >= n Theta( n + f log n) if f < n But we can do even better! 7. Trees with "union-by-rank" and path compression. With trees, the measure that matters the most for the running time is the height of each tree, not its size. So, instead of using weight to decide how to carry out UNION, we use "rank". The rank is an upper bound on the height of the tree (i.e., if there were no path compression, then the rank would be exactly equal to the height, but it is not efficient to try to keep rank exactly equal to height during path compression so the actual height of a tree could become smaller than its rank during the course of a sequence of operations). . MAKE-SET(x): as before, and set the rank of x to 0 . FIND-SET(x): use path compression and leave ranks unchanged . UNION(x,y): the node with higher rank is the new root and its rank is unchanged; if the two nodes have the same rank, pick the first one as the new root and increase its rank by 1 It is possible to prove that the worst-case time for a sequence of m operations, where there are n MAKE-SETs, is O(m log* n), where the log* function has the following property: log*_2(2^(2^(...(2^2)...)) = i \____ i times____/ In particular: log*_2(2) = 1 log*_2(4) = log*_2(2^2) = 2 log*_2(16) = log*_2(2^(2^2)) = 3 log*_2(65536) = log*_2(2^(2^(2^2))) = 4 log*_2(2^65536) = log*_2(2^(2^(2^(2^2)))) = 5 Note that 2^65536 is approximately equal to 10^20000, so for any "real life" input size n, log* n is essentially a constant less than 5 (even though mathematically, log* is a growing function that has no upper bound). More precisely and generally, log*_b(n) is defined as the smallest integer i such that log_b(log_b(...log_b(n)...)) <= 1. \_____ i times_____/