A really fast way to find similar documents
Suppose we could convert each document into a binary
feature vector in such a way that similar documents have
similar feature vectors.
This creates a “semantic” address space that allows
us to use the memory bus for retrieval.
Given a query document we first use the autoencoder to
compute its binary address.
Then we fetch all the documents from addresses that
are within a small radius in hamming space.
This takes constant time. No comparisons are
required for getting the shortlist of semantically similar
documents.