CSC 150H - Hashing

Introduction to Hashing

Lecture 1.: Blackboard lecture on hashing.

Lecture Summary:

We introduced the notion of a linked hash table (aka, a chained hash table)
Hash Function Contract: If object w1 equals object w2 then w1.hashCode() == w2.hashCode(). This ensures we will get the same index into our table given equal starting objects w1 or w2.
Uniform Hashing Assumption: The hash(w, cap) function maps into the range {0, 1, ... , cap-1}. The basic assumption is that the probability that hash(w, cap) == k is approximately constant, independent of k. That is, p(hash(w, cap) == k) approx= 1/cap. This ensures that different objects are spread out (in a probabilistic sense) across the different indices in the hash table.
Define the loadFactor to be n/cap. The default load factor for java.util.HashSet or java.util.HashMap is 0.75.
Define lk to be the length of the sublist attached to the hash table at index k. Then we showed that it follows from the Uniform Hashing Assumption that the probability that the k-th list has length s is given by the Binomial distribution:
```
  p(lk == s) = C(n, s) p^s (1-p)^(n-s)
where
  n = total number of items in the hash table,
  p = 1/cap,
  C(n,s) = n! /(s! (n-s)!) = "n choose s".
```
The mean of this distribution is np = n/cap = loadFactor. For typical load factors (eg. 1 or smaller), this formula predicts that the probability that a sublist has length 10 or bigger is rather small (i.e., a probability of 1/10,000,000).
The operations insert, delete, and contains for a hash table should run in time bounded by the length of the longest sublist. If the uniform hashing assumption holds, then these almost certainly can be done in O(1) time (without considering the cost of resizing the hash table).
Resizing the hash table (changing the capacity "cap") requires all the items to be rehashed. This will cost O(n) time, where n is the number of items in the hash table (i.e., the size).

Lecture 2: Application of hashing to the Maze problem.

Note that java.util.HashSet is unsuitable for our current application, since we cannot get access to individual objects within the set (other than by iterating over the set, or using toArray()).

We therefore implement the ReachableSet interface using the java.util.HashMap class.

We will use the parameterized type HashMap<Board, MultiNode>. That is, we will index into the hash map using a Board as the key. The data associated with the Board will be the MultiNode with that Board as data.

Note a Board currently requires a two dimensional array to encode the state of the maze, and thus each Board requires a significant amount of memory on its own. At first glance, the idea of using a HashMap seems wasteful in terms of memory, since we have to store all the Boards used as search keys and all the Boards within the MultiNodes themselves. Therefore it seems this approach might take up roughly twice as much memory as a simple ArrayList of MultiNodes.

However this is not the case. The Board being used as a key is stored exactly once, with both the key and the associated MultiNode referring to it. The memory overhead in using a HashMap is therefore not so prohibitive.

We need to write:

A method hashCode() in Board.java (overriding Object's hashCode() method).
A new implementation of the interface ReachableSet, say called ReachableHash.

Hashing Maze Boards

The following is a simple hash function for the Board class in the maze example. (Include this in Board.java.) Due to integer overflow, the value returned by hashCode() can be negative.


 public int hashCode()  {
  int code = 0;
  for (int r = 0; r < matrix.length; r++)
  {
   for (int c = 0; c < matrix[0].length; c++)
   {
     code = code*127 + matrix[r][c];
   }
  }
  return code;
 }

Note the hash function contract is satisfied for the above function hashCode(). (Why?)

To restrict the range of the hash function, we could define (inside the implementation of HashMap, say):

  private static int hash(Object o, int cap) {
    return Math.abs(o.hashCode()) % cap;
  }

The combined result of these two mappings can be thought of in terms of the Baker's transformation or the Horeshoe map.

The intuition is that we stretch the values of various board configurations along the real axis by repeatedly multiplying by a large prime number (127 in this case), and allowing these values to fold back on themselves via integer overflow in hashCode() and via the remainder operation in the method hash(Object, int).

This is a little like kneading bread, repeatedly flattening the dough out then folding it back on itself. Kneading is useful for getting a roughly uniform distribution of ingredients throughout the dough. We're doing a similar thing here to approximate the uniform hashing assumption.

The previous hash function may not be desirable if the capacity is a multiple of the factor 127. More complex hash functions can avoid this (see choosing a good hash function). But this is beyond the scope of our discussion.

We ran the above hash function on the reachable set for the Maze for level 08. (This maze produced a reachable set of size 22,042.) The capacity was chosen to be 20,000. As a result, the load factor was 1.10, somewhat higher than the default of 0.75.

The resulting sublists in the hash table have lengths ranging from 0 to 8. The observed frequencies of each length are shown below:

Note the excellent fit of the binomial model to the observed frequencies, indicating that the uniform hashing assumption is reasonable, at least for these boards and the current capacity. (Math rules!)

The ReachableHash class

Our second job is to write the class definition for ReachableHash, which must implement the interface ReachableSet.

See ReachableHash.java.

Note:

As part of the Java Collection Classes, HashMap provides a standard set of methods, including iterator(). This makes it easy to implement ReachableHash using a HashMap.
The method get(int index) is O(size) in ReachableHash, while it was O(log(size)) in ReachableAVLTree. However, we only use get(int index) to set through the reachable set for display purposes (i.e., using the menu item "Show Reachable"), never during the solve stage. You might notice a larger delay while displaying the reachable set, especially for larger indices "index". However, this is not a time-critical portion of the code. We could speed this up by dumping the reachable set to an array at the beginning of "Show Reachable", and then stepping through this array. I didn't do this due to the memory required (which is just an array of the size of the reachable set, consisting of references to the reachable MultiNodes, we would not duplicate the MultiNodes).

The runtime of the maze solver using ReachableHash is a little faster than our AVL tree implementation, as might be expected from the big-Oh analysis. The following table reports the runtime in seconds on the same machine:

Maze	ArrayListReachableSet	ReachableTable	ReachableAVLTree	ReachableHash
level08.maz	7565 (= 2.1 hrs)	94.5	3.1	2.4
level09.maz (*)	--	855	4.8	4.6
level10.maz (*)	--	336	4.2	3.9

Here (*) denotes the machine ran out of memory before solving (in all cases).

Comparing the use of our implementation of an AVL tree to the use of HashMap from the Java Collections Framework, we do not see a big improvement in the runtime for solving the mazes. This is probably due to the time overhead of other parts of the breadthFirstSearch() method. These two implementations of ReachableSet have reduced the runtime complexity of the breadth first search algorithm to O(size log(size)) and O(size), respectively. Here size is "only" 20K - 40K, so log2(size) is around 14 or 15. Note that many other steps in breadthFirstSearch() require O(size) runtimes, and these may require a significant amount of time compared to the single reachableSet.get(MultiNode) method call done repeatedly in breadthFirstSearch(). (See exercise 2 below for a comparison of the runtimes for this "accessor" step without as much extra overhead.)

Exercises

Modify your Board.java class to include the above hashCode() method. Modify your Reachable.java class to use the above ReachableHash.java. Run the MazeGui program on several of the maze examples. Do you see a similar improvement in runtimes from your previous code as those reported in the table above?
We wish to measure the speed up of the "contains" method (if any) due to using the ReachableHash implementation over the ReachableAVLTree. One way to do this is to add some timed test code at the end of your breadthFirstSearch() implementation which primarily exercises the desired method.

To do this you first need to ensure that your breadthFirstSearch() method in graph/Reachable.java does not return directly from the loop over elements in the toDo queue, but rather breaks out of that loop once the goal state is found and then executes the busy finder code below. Also, note this code will not be executed if the search runs out of memory (since in that case an OutOfMemoryError is thrown and the breadthFirstSearch() method is abnormally terminated).

Insert the following at the end of your breadthFirstSearch() method in Reachable.java:
```
  private void breadthFirstSearch() {
     ....

    } // Over toDo queue

    if (true) {
      System.out.println("Busy finder");
      ElapsedTime et = new ElapsedTime();
      et.start();
      for(int k = 0; k < 100; k++) {
        Iterator<MultiNode> rIt = reachableSet.iterator();
        while(rIt.hasNext()) {
          reachableSet.get(rIt.next());
        }
      }
      System.out.format("Busy finder time: %5.1f\n", et.time()/1000.0);
    }
     
  } // end of breadthFirstSearch
```
You need to import adt.ElapsedTime within Reachable.java.

Which approach is faster, ReachableAVLTree or ReachableHash? What is the ratio of runtimes reported by this busy finder code when ReachableAVLTree is used to implement reachableSet, versus ReachableHash?
Copy the maze/ReachableHash.java file to a new file maze/ReachableTree.java. Modify the latter to use the java.util.TreeMap class instead of java.util.HashMap. This replaces the HashMap with Java's implementation of red-black trees (an alternative self-balancing tree implementation). Note how easy this is to do given the similar interfaces for TreeMap and HashMap.

If you already have done question 2 above, your code should report the runtime of the busy finder loops, now using red-black trees. How does this compare to your previous results using HashMap? If you set the condition in the if statement wrapping the busy finder code to false, how does the overall runtime of the breadthFirstSearch() solver (i.e., the time reported by MazeGui) compare to the other methods?