Tutorial 5: Cuckoo Hashing - average case analysis of the insertion algorithm

This note concerns the analysis of the insertion algorithm of the Cuckoo hashtable. Our goal is to give a theorem about the expected time of the insertion algorithm.

This note is an unpacking of Rasmus Pagh's note 'Cuckoo Hashing for Undergraduates, aiming to be more explicit about the mathematical formalization (something often left implicit in computer science, which can provide a barrier to the understanding). If you want a different exposition, look at Pagh's notes.

Recall

A cuckoo hashtable is a hash table which

has a single item in each of its $r$ slots
uses two hash functions $h_1, h_2:\bit^* \to [r]$ to map items to slots
lookups
- To check if an item $x$ is in the table, check whether positions $h_1(x)$ or $h_2(x)$ contain the item.
- This is $O(1)$ worst-case
Deletions are $O(1)$ in the worst-case.
Insert($x$)
- Put $x$ in position $h_1(x)$; if that slot is taken, eject the occupying item to its 'other position', recursing until we find an empty slot or failing if we try to relocate any element $x$ more than once.

Preliminaries

We take there to be $n$ items in question. We assume that for any item $x$, $h(x)$ is uniformly distributed across $[r]$. (Hash functions with this property are often described as random oracles). Note that given this stance we are indifferent to the particular items since the distribution of occupied slots is the same regardless.

Our goal

Consider the following experiment: insert $n$ items into a cuckoo hashtable.

Let $t_I(n)$ be the time to insert the last element. We aim to show that \begin{equation} \label{eq:goal} \Exp_h[ t_I(n) ] = O(1) \end{equation} Note here that while the probability is over $h$, this is averaging over all $n$ insertions.

Framing the analysis

The Cuckoo Graph

To assist us in reasoning about displacements during insertion, we define the cuckoo graph $G$ whose vertices correspond to slots in the table, and whose edges correspond to items to be inserted; i.e. $V = [r]$, $E = \{ e_x = {h_1(x), h_2(x)} \mid x \in S \}$ where $S$ is the set of items.

We also need to define the concept of the bucket of $x$, which we denote $B(x)$; this is the set of all positions reachable in the Cuckoo Graph starting from $h_1(x)$ or $h_2(x)$; said differently, $B(x)$ is the component of $G$ with edge $e_x$.

Notes

Though we may think of the set of items $S$ as fixed, the values of hash function are random variables over $[r]$, and so this also makes $G$ a random variable (i.e. a random graph). So to is $B(x)$ for any item $x$.
If $G$ encodes a table, each component of $G$ will never have fewer vertices than edges. Further, if the addition of edge $e_x$ would violate this property, then the addition of $x$ would cause a complete rehash of the table.

The following lemma justifies these definitions

To insert a new item $x$, we must shuffle, in the worst case, the elements in all the slots $B(x)$ (before the insert succeeds or we abort). Thus \begin{equation} \label{eq:lemma_0} t_I(x) \leq |B(x)| \end{equation} Thus it will suffice, for \eqref{eq:goal} to show that \begin{equation} \label{eq:goal_restated} \Exp |B(x)| \leq O(1) \end{equation}

Technical Details

First note that \begin{align*} \Exp |B(x)| &= \sum_{y \in S} \Pr[ e_y \in B(x)] \\ &= n \Pr[ e_y \in B(x) ] \end{align*}

Thus if we can show that $\forall x \neq y$, \begin{equation} \label{eq:lemma_2} \Pr[ e_y \in B(x) ] = O(1/r) \end{equation} then we immediately establish \eqref{eq:goal_restated}. \eqref{eq:lemma_2} is easy consequence of the following lemma:

Lemma

For any $i,j \in [r]$ and any $c\ge1$, if $r \geq 2cn$ then \begin{equation} \label{eq:lemma_1} \Pr[\exists i \rightarrow j \mbox{ path (in $G$) of length } \leq \ell ] \leq c^{-\ell}/r \end{equation} This probability is over realizations of $G$

What remains are technical details. We work them out here for completeness, trying to break down the reasoning in such a way that each line follows obviously from the preceeding. The only fact about $G$ we use is that each edge is drawn independently; the only fact about probabilities we use repeatedly is the union bound

Proof of \eqref{eq:lemma_2}

\begin{align*} e_y \in B(x) &\iff \exists h_1(y) \rightarrow h_2(x) \mbox{ path } \\ &\iff \bigcup_{\ell=1}^n \exists h_1(y) \rightarrow h_2(x) \mbox{ path of length } \ell \end{align*} Thus we have \begin{align*} \Pr[e_y \in B(x)] &\leq \sum_{\ell \geq 1} c^{-\ell}/r \\ &= O(1/r) \end{align*}

Proof of \eqref{eq:lemma_1}

The proof is by induction on $\ell$. For the base case, $\ell = 1$. \begin{align*} \Pr[ \exists i \rightarrow j \mbox{ path of length 1} ] &= \Pr[ \{i,j\} \in E ] \\ &\leq n \Pr[ h_1(x) = i \land h_1(x) = j ]\\ &= n \frac{2}{r^2} \\ &= \frac{2n}{r} \frac{1}{r} \\ &= c^{-1} \frac{1}{r} && \mbox{(since $r \geq 2cn$)} \end{align*}

For the inductive step, assume \eqref{eq:lemma_1} holds for all paths of length $\ell-1$ or less. \begin{align} \Pr[ \exists i \rightarrow j \mbox{ path of length \ell} ] &= \Pr[ \cup_k \exists i \to k \mbox{ path of length } \ell-1 \land \{k,j\} \in E] \notag \\ &\leq \sum_k \Pr[ \exists i \to k \mbox{ path of length } \ell-1 \land \{k,j\} \in E] \label{eq:lemma_1_1} \end{align} Consider the events $E_1 = \exists i \to k \mbox{ path of length } \ell-1$ and $E_2 = \{k,j\} \in E$. We can apply our inductive hypothesis to get $\Pr[E_1] = c^{-(\ell -1)}/r$. We will calculate $\Pr[E_2 \mid E_1]$ directly. \begin{align*} \Pr[E_2 \mid E_1] &= \Pr[\exists x h_1(x) = k \land h_2(x) = j \mid \exists \mbox{ a path of length $\ell-1$ in the graph} ] \\ &\leq (n - \ell) \frac{2}{r^2} \\ &\leq c^{-1}/r \end{align*} Noting that $\Pr[E_1 \land E_2] = \Pr[E_1]\Pr[E_2 \mid E_1]$, \eqref{eq:lemma_1_1} becomes $\sum_k c^{-\ell}/r^2$ which is at most $c^{-\ell}/r$.

CSC 265 Fall 2013