lec2b

A really fast way to find similar documents

•

Suppose we could convert each document into a binary

feature vector in such a way that similar documents have

similar feature vectors.

–

This creates a “semantic” address space that allows

us to use the memory bus for retrieval.

•

Given a query document we first use the autoencoder to

compute its binary address.

–

Then we fetch all the documents from addresses that

are within a small radius in hamming space.

–

This takes constant time. No comparisons are

required for getting the shortlist of semantically similar

documents.