How good is a shortlist found this way?
We have only implemented it for a million
documents with 20-bit codes --- but what could
possibly go wrong?
A 20-D hypercube allows us to capture enough
of the similarity structure of our document set.
The shortlist found using binary codes actually
improves the precision-recall curves of TF-IDF.
Locality sensitive hashing (the fastest other
method) is 50 times slower and has worse
precision-recall curves.