How to find documents that are similar to a
query document
Convert each document into a “bag of
words”.
This is a vector of word counts
ignoring the order.
Ignore stop words (like “the” or “over”)
We could compare the word counts of
the query document and millions of other
documents but this is too slow.
So we reduce each query vector to a
much smaller vector that still contains
most of the information about the
content of the document.
fish
cheese
vector
count
school
query
reduce
bag
pulpit
iraq
word
0
0
2
2
0
2
1
1
0
0
2