CSC 150H - Searching

Searching

We wish to apply our understanding of big-Oh analysis to algorithms for searching a sorted list of items.

In particular, we will assume that our list is stored in an array of Comparable objects. Here Comparable is an interface in the java.lang package, which consists of just one method:

  /** Compares this object with the specified object for order. 
   *  Returns a negative integer, zero, or a positive integer as 
   *  this object is less than, equal to, or greater than the 
   *  specified object. */
  int compareTo(Comparable o)

This method defines what we mean to have objects ordered in increasing order. For example, for Integers, compareTo just provides the standard arithmetical order. For Strings it provides standard alphabetical ordering.

Binary Search

Given a sorted list of n items, how quickly can we search it to find whether a given item is in the list?

Consider asking only yes/no questions (aka binary-valued questions). For example, we could begin by asking if the item is in the first half of the list. Suppose it is, then we could ask if it is in the first half of this particular half, and so on. With each question we can roughly half the size of the sublist we know it must be in (if it is there at all).

More generally, suppose that in our search algorithm we maintain a sublist of possible locations for the item we are looking for. If we can choose suitable binary-valued questions to rule out roughly half the current sublist in which we should find the item (if it is there at all), then after each question we would nearly half the length of the remaining sublist we still need to search. (This is the appropriate strategy for the common game of "20 questions".)

For a concrete example, if we start with a sorted list of size n = 100. Then after each suitable binary question, the remaining list must have at most the following lengths.

ceil(100/2) = 50
ceil(50/2) = 25
ceil(25/2) = 13
ceil(13/2) = 7
ceil(7/2) = 4
ceil(4/2) = 2
ceil(2/2) = 1
Check 1 item to see if it matches.

Note, if the list has an odd length n, then the the two "halves" have lengths floor(n/2) and ceil(n/2). But since we are interested in the largest number of item comparisons needed by the algorithm, we only use the larger of these two values above.

From the above table, we see that it should take at most 8 (carefully chosen) binary questions to search a sorted list of length 100. A longer table shows that it only takes 11 questions to search a sorted list of length 1000. With 20 questions, we can search more than half a million items.

A little thought shows that we cannot be guaranteed to reduce the size of the sublist faster than this, at least with binary questions. For example, we could try asking if the item would be in the first 1/3 of the list. If it was, then we'd only have a third of the list to search. However, if the answer is no, then we would be left with 2/3 of the list to search. The worst case here, namely being left with 2/3 of the list to search, is larger than the worst case when we only ask which half of the list the item might be in. Therefore dividing by half each time provides the best "worst-case" behaviour.

If we ignore rounding to integer lengths, then after k questions the remaining sublist will have a length

 
  L_k = n/2^k.

We need to choose k big enough to reduce the sublist to length 1, so

  Choose k s.t. L_k approx= 1.

That is,

  2^k approx= n, or k approx= log_2(n)

where log_2(n) is the base 2 logarithm of n.

Finally, given a list of length 1, we need one more question to compare the item we are looking for with the item in the sublist of length 1. So the total number of questions is (roughly):

  k = ceil(log_2(n)) + 1.

For n = 100, log_2(100) = 6.64..., so k here works out to be 7+1 = 8. This agrees with the example above.

We could use mathematical induction to prove that this rough calculation is actually correct. Mathematical induction is introduced in CSC 165H, and we won't go into the details here. However, for the curious, we might attempt to prove that S(k) is true for k = 0, 1, 2, ..., where S(k) is the statement:

S(k): A sorted list of length n, such that 2^(k-1) < n <= 2^k, can be searched for any given item using at most k+1 binary questions.

Binary Search Requires O(log(n)) Comparisons

We wish to rewrite the preceeding result in terms of big-Oh terminology. In particular, we argued that binary search requires

  k = ceil(log_2(n)) + 1

compareTo operations to search a list of length n. We can simplify this, by noting

  ceil(log_2(n)) <= log_2(n) + 1
                 =  log(n)/log(2) + 1
                 <= log(n)/log(2) + log(n)/log(2) for n >=2
                 = (2/log(2)) log(n)

Therefore, for c = 2/log(2),

  ceil(log_2(n)) + 1 <= c log(n) for all n >=2.

Therefore ceil(log_2(n)) + 1 = O(log(n)). As a consequence, the binary search algorithm requires O(log(n)) compareTo operations.

Moreover, if each comparison runs in time O(1), then we might hope the whole binary search algorithm could run in O(log(n)) time. Let us write the algorithm and check this out.

Writing Binary Search in Java

Suppose we want to write a Java program to search a sorted list. The outline of the code is given below.

Note from the binarySearch method comment that the return value when the item is found in the list is supposed to be the index of the item in the array storing the list. However, when the item is not found, we indicate this by returning a negative integer. Moreover, this negative value can be used to determine where the search item belongs in the sorted list. Unfortunately, due to the fact that -0 and 0 represent the same integer value, we need to shift all these negative return values over by 1.

For example, if the item we are searching for belongs in index 3 of the list[] array, binarySearch should return -(3 + 1) = -4. Similarly, if the item is not present in the list and is smaller than all the other items in the list, then the method should return -(0 + 1) = -1.

Also, note we have left out a few of the details in the method definition below. We do this in order to illustrate the utility of writing loop invariants before completing the code:

  public class ListUtil {

    /** Search an array of comparable objects which is sorted
     *  in increasing order for a given item.  If the item is
     *  found in the array, return the index of the item.  
     *  Otherwise, if the item is not in the array,
     *  return -(index+1) where index is the location the item 
     *  should be placed in the order within the array. 
     *  Requires: list and item are not null.     */
    public static int binarySearch(Comparable[] list, 
                                   Comparable item) {
      int first = 0;
      int last = list.length;
      if (last == 0) {
        // Empty list
        return -1;  // -(first + 1)
      }

      int res; // Hold the result of a compareTo evaluation.

      // Loop invariant:  If the item is in the list then
      // it must have index j s.t. first <= j < last
      while (CONDITION) {
        m = first + (last - first)/2;
        res = list[m].compareTo(item);
        UPDATE first and/or last
      }
      res = list[first].compareTo(item);

      return WHAT
    }
  }

To complete the method we must fill in:

the loop CONDITION,
how to UPDATE first and last each time around the loop,
specify exactly WHAT to return.

It is very common to introduce off by one errors (or worse) while writing this code, and we wish to avoid these errors. Even experienced programmers can easily get caught with bugs in the details of this apparently simple algorithm.

In order to help us write correct code in the first place, we have introduced a loop invariant before filling in the details of the method. The loop invariant in this case is simply a comment in the code. It describes something that is true both before the the loop body is ever executed, and is also true after each execution of the loop body. That is, it should always be true (and hence, invariant).

When we come to fill in the details of the method, we use the loop invariant to help figure out exactly what we need to do. We illustrate this reasoning next.

For the loop condition, we want to halve the size of the remaining sublist so long as this sublist has length 2 or bigger. The loop invariant explains that the remaining sublist consists of the items at indices first, first+1, ..., last-1. Therefore, the length of the sublist is last - first. So the loop condition could be (last - first > 1), or (last > first+1).

Note that inside the loop body we set

  m = first + (last - first)/2;  // using integer division.

For the loop to always reduce the size of the remaining sublist it is critical that first < m < last. Notice the loop condition guarantees, last-first >= 2, so (last-first)/2 >= 1 and it follows that m > first. In a similar way we can show m < last.

An alternative (and much more common) way to compute m is as follows:

  m = (first + last)/2;

This usually gives the same value as the preceeding expression, and only requires two arithmetic operations versus three. The downside of this latter formulation is that for really big arrays, where n is bigger than Integer.MAX_VALUE/2, the sum (first + last) might actually be bigger than Integer.MAX_VALUE. In this case we would get integer overflow with the second expression for m above, and the resulting m would be negative! This would eventually cause an ArrayIndexOutOfBoundsException to be thrown.

Proceeding with the update step within the loop, we wish to update first or last with the value of m. Again we will use the loop invariant to decide what we should do.

If res < 0 then it follows that list[m] < item. That is, item must come after the midpoint m of the current list. We are therefore safe to update first to be m. That is, the loop invariant will be satisfied once again.
If res > 0 then it follows that list[m] > item. That is, item must come before the midpoint m of the current list. We are therefore safe to update last to be m.
If res == 0 we have found the item at index m, and we could stop early. However, we could also set first to m, and keep searching. The loop invariant will still hold in this case.

As a result, we update first or last within the loop body as follows:

  if (res <= 0)
    first = m;
  else
    last = m;

Finally, after the loop terminates, we see from the loop condition that last = first + 1. (We leave it to the reader to explain why last > first.) This means the sublist we have left to search has length 1. By the loop invariant, if the item is in the list at all it must be equal to the only item in this sublist, i.e., list[first]. Therefore, the returned value should be as follows:

  if (res == 0)
    return first;
  if (res > 0)  // target item should go in location first
    return -(first+1); 
  return -(first+2); // target item should go in location first+1

Binary Search Runs in O(log(n)) Time

Here is the completed method (see BinarySearch.zip):

    public static int binarySearch(Comparable[] list, 
                                   Comparable item) {
      int first = 0;
      int last = list.length;
      if (last == 0) {
        // Empty list
        return -1;  // -(first + 1)
      }

      int res; // Hold the result of a compareTo evaluation.

      // Loop invariant:  If the item is in the list then
      // it must have index j s.t. first <= j < last
      while (last > first + 1) {
        m = first + (last - first)/2;
        res = list[m].compareTo(item);
        if (res <= 0)
          first = m;
        else
          last = m;
      }
      res = list[first].compareTo(item);
      if (res == 0)
        return first;
      if (res > 0)  // target item should go in location first
        return -(first+1); 
      return -(first+2); // target item should go in location first+1
    }

Assuming that the calls to compareTo() take O(1) time, the body of the loop in binary search takes O(1) time. We found above that the loop is executed ceil(log_2(n)) times. Therefore the loop itself takes O(ceil(log_2(n))) time. This can be expressed simply as O(log(n)) time. The code before and after the loop in binarySearch() also takes O(1) time, so the overall runtime for the algorithm is O(log(n)).

Testing Binary Search

The completed BinarySearch class, along with a simple JUnit test class are available from BinarySearch.zip. The code is different than the one we wrote above in that there are extra debug and diagnostic statements. Try it out.

Note when the tests are run, the reported number of comparisons made in each case is less than or equal to the upper bound we computed above, namely:

  k = ceil(log_2(n)) + 1.

Conditional Compilation

When we first write and test binarySearch() it is useful to have the extra debug statements in the above code present. But once we become confident in the code, we would like to be able to remove them. Removing all the debugging code using a text editor would be boring, bug prone, and difficult to reverse.

The code above wraps all the debugging statements within "if (DEBUG ...)" blocks. Therefore, setting DEBUG to false allows these blocks to be skipped. However, if DEBUG was not declared to be final, then we are still slowing our code down with these extra "if (DEBUG...)" checks. When DEBUG is false, they would all fail, but checking the conditions at all is wasting time.

For once we can get the best of both worlds. The trick is to declare the DEBUG flag to be final. Since it is final, the compiler recognizes that it cannot be changed, and simply substitutes the flag's value (e.g., false) into the code where ever it appears. However, now the compiler sees statements like:

  if (false) 
     count++

  if (false && BINARY_SEARCH_CHATTY) {
    ... a block of output statements ...
  }

The compiler then identifies that neither block of code can ever be executed. So it deletes these blocks from the compiled byte code. The end effect is that the code runs as efficiently as if you deleted all these if-blocks by hand.

If later you wish to do further tests, you can edit the file to set DEBUG to true, and recompile. The compiler will now keep all the debugging code blocks, and output additional debugging information. This is exactly the behaviour we want (just don't forget the "final" keyword).

The compiled code itself is therefore conditional on various flags in the code. This property is called "conditional compilation". Other programming languages support this in other ways. For example, C and C++ have explicit #ifdef directives for conditionally including or excluding blocks of code during the compilation stage.

Exercises

What happens if we set m = (first + last - 1)/2 in binarySearch? This is a common bug in writing binary search from scratch. What happens if we also initialize DEBUG to be false? Explain what goes wrong. Similarly, explain what happens when we use m = (first + last + 2)/2.
The compareTo() method returns 3 values (i.e., -1, 0, 1), not just 2 values. So we aren't just asking yes/no questions, but rather questions with three possible outcomes. Then why can't we speed binarySearch() up by dividing the list into thirds each time, and require only k = ceil(log_3(n))+1 calls to compareTo() instead of k = ceil(log_2(n)) + 1? For example, for n = 100, ceil(log_3(n))+1 = 5 while ceil(log_2(n))+1 = 8, so can we get away with only 5 calls to compareTo() instead of 8? (The general issue of how much we learn from a query is quantified by Information Theory, which is typically introduced in a third year CSC course.)
Suppose the list was stored as a LinkedList instead of as an array (with the items still sorted in increasing order). Suppose the only accessor method we have in a LinkedList class is the following method:
```
  /** Return the k-th item in the list.  This
   *  method requires O(k+1) time to walk through
   *  the list, starting at the head and traversing 
   *  through to the k-th item. */
  public Object get(int k)
```
How many compareTo operations does your modified binary search require to locate an item in a linked list (hopefully still O(log(n))). What is the order of the runtime of your linked binary search algorithm (consider only the worst-case)?
Show log_B(n) is O(log(n)) and log(n) is O(log_B(n)) for any base B > 1. Here log_B denotes the base-B logarithm. Thus, as far as big-Oh is concerned, the base of the logarithm does not matter.
Show log(n) is O(n) but n is not O(log(n)).
Rewrite the UPDATE step in binarySearch() so that if list[m] equals the target item, the loop stops early. Use the ElapsedTime class from Sudoku.zip to estimate the runtime of the revised algorithm. Compare the runtime of the revised algorithm with that of the completed algorithm above. Use the same list of data items with the same randomly selected search items used for both algorithms. Time both algorithms over many trials and average these runtimes. Which algorithm has a faster average runtime? Explain.
Wait a week or two after reading these lecture notes. Then reload BinarySearch.zip and delete the body of the binary search method (without reading it carefully). Try to write it yourself from scratch without using a loop invariant. Test your code using the provided JUnit tester. If your method is buggy, don't worry, the majority of experienced programmers get this wrong. Try writing the code again, this time by first formulating an appropriate loop invariant.