This note concerns the analysis of the insertion algorithm of the Cuckoo hashtable. Our goal is to give a theorem about the expected time of the insertion algorithm.
This note is an unpacking of Rasmus Pagh's note 'Cuckoo Hashing for Undergraduates, aiming to be more explicit about the mathematical formalization (something often left implicit in computer science, which can provide a barrier to the understanding). If you want a different exposition, look at Pagh's notes.
Recall
A cuckoo hashtable is a hash table which
- has a single item in each of its $r$ slots
- uses two hash functions $h_1, h_2:\bit^* \to [r]$ to map items to slots
- lookups
- To check if an item $x$ is in the table, check whether positions $h_1(x)$ or $h_2(x)$ contain the item.
- This is $O(1)$ worst-case
- Deletions are $O(1)$ in the worst-case.
- Insert($x$)
- Put $x$ in position $h_1(x)$; if that slot is taken, eject the occupying item to its 'other position', recursing until we find an empty slot or failing if we try to relocate any element $x$ more than once.
Preliminaries
We take there to be $n$ items in question. We assume that for any item $x$, $h(x)$ is uniformly distributed across $[r]$. (Hash functions with this property are often described as random oracles). Note that given this stance we are indifferent to the particular items since the distribution of occupied slots is the same regardless.
Our goal
Consider the following experiment: insert $n$ items into a cuckoo hashtable.
Let $t_I(n)$ be the time to insert the last element. We aim to show that \begin{equation} \label{eq:goal} \Exp_h[ t_I(n) ] = O(1) \end{equation} Note here that while the probability is over $h$, this is averaging over all $n$ insertions.
Framing the analysis
The Cuckoo Graph
To assist us in reasoning about displacements during insertion, we define the cuckoo graph $G$ whose vertices correspond to slots in the table, and whose edges correspond to items to be inserted; i.e. $V = [r]$, $E = \{ e_x = {h_1(x), h_2(x)} \mid x \in S \}$ where $S$ is the set of items.
We also need to define the concept of the bucket of $x$, which we denote $B(x)$; this is the set of all positions reachable in the Cuckoo Graph starting from $h_1(x)$ or $h_2(x)$; said differently, $B(x)$ is the component of $G$ with edge $e_x$.
Notes
- Though we may think of the set of items $S$ as fixed, the values of hash function are random variables over $[r]$, and so this also makes $G$ a random variable (i.e. a random graph). So to is $B(x)$ for any item $x$.
- If $G$ encodes a table, each component of $G$ will never have fewer vertices than edges. Further, if the addition of edge $e_x$ would violate this property, then the addition of $x$ would cause a complete rehash of the table.
The following lemma justifies these definitions
To insert a new item $x$, we must shuffle, in the worst case, the elements in all the slots $B(x)$ (before the insert succeeds or we abort). Thus \begin{equation} \label{eq:lemma_0} t_I(x) \leq |B(x)| \end{equation} Thus it will suffice, for \eqref{eq:goal} to show that \begin{equation} \label{eq:goal_restated} \Exp |B(x)| \leq O(1) \end{equation}
Technical Details
First note that \begin{align*} \Exp |B(x)| &= \sum_{y \in S} \Pr[ e_y \in B(x)] \\ &= n \Pr[ e_y \in B(x) ] \end{align*}
Thus if we can show that $\forall x \neq y$, \begin{equation} \label{eq:lemma_2} \Pr[ e_y \in B(x) ] = O(1/r) \end{equation} then we immediately establish \eqref{eq:goal_restated}. \eqref{eq:lemma_2} is easy consequence of the following lemma:
Lemma
For any $i,j \in [r]$ and any $c\ge1$, if $r \geq 2cn$ then \begin{equation} \label{eq:lemma_1} \Pr[\exists i \rightarrow j \mbox{ path (in $G$) of length } \leq \ell ] \leq c^{-\ell}/r \end{equation} This probability is over realizations of $G$
What remains are technical details. We work them out here for completeness, trying to break down the reasoning in such a way that each line follows obviously from the preceeding. The only fact about $G$ we use is that each edge is drawn independently; the only fact about probabilities we use repeatedly is the union bound
Proof of \eqref{eq:lemma_2}
\begin{align*} e_y \in B(x) &\iff \exists h_1(y) \rightarrow h_2(x) \mbox{ path } \\ &\iff \bigcup_{\ell=1}^n \exists h_1(y) \rightarrow h_2(x) \mbox{ path of length } \ell \end{align*} Thus we have \begin{align*} \Pr[e_y \in B(x)] &\leq \sum_{\ell \geq 1} c^{-\ell}/r \\ &= O(1/r) \end{align*}
Proof of \eqref{eq:lemma_1}
The proof is by induction on $\ell$. For the base case, $\ell = 1$. \begin{align*} \Pr[ \exists i \rightarrow j \mbox{ path of length 1} ] &= \Pr[ \{i,j\} \in E ] \\ &\leq n \Pr[ h_1(x) = i \land h_1(x) = j ]\\ &= n \frac{2}{r^2} \\ &= \frac{2n}{r} \frac{1}{r} \\ &= c^{-1} \frac{1}{r} && \mbox{(since $r \geq 2cn$)} \end{align*}
For the inductive step, assume \eqref{eq:lemma_1} holds for all paths of length $\ell-1$ or less. \begin{align} \Pr[ \exists i \rightarrow j \mbox{ path of length \ell} ] &= \Pr[ \cup_k \exists i \to k \mbox{ path of length } \ell-1 \land \{k,j\} \in E] \notag \\ &\leq \sum_k \Pr[ \exists i \to k \mbox{ path of length } \ell-1 \land \{k,j\} \in E] \label{eq:lemma_1_1} \end{align} Consider the events $E_1 = \exists i \to k \mbox{ path of length } \ell-1$ and $E_2 = \{k,j\} \in E$. We can apply our inductive hypothesis to get $\Pr[E_1] = c^{-(\ell -1)}/r$. We will calculate $\Pr[E_2 \mid E_1]$ directly. \begin{align*} \Pr[E_2 \mid E_1] &= \Pr[\exists x h_1(x) = k \land h_2(x) = j \mid \exists \mbox{ a path of length $\ell-1$ in the graph} ] \\ &\leq (n - \ell) \frac{2}{r^2} \\ &\leq c^{-1}/r \end{align*} Noting that $\Pr[E_1 \land E_2] = \Pr[E_1]\Pr[E_2 \mid E_1]$, \eqref{eq:lemma_1_1} becomes $\sum_k c^{-\ell}/r^2$ which is at most $c^{-\ell}/r$.