 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| • |
Convert each
document into a “bag of
|
|
|
words”.
|
|
|
|
– |
This
is a vector of word counts
|
|
|
ignoring
the order.
|
|
|
|
– |
Ignore
stop words (like “the” or “over”)
|
|
| • |
We could compare
the word counts of
|
|
|
the query
document and millions of other
|
|
documents but
this is too slow.
|
|
|
|
– |
So
we reduce each query vector to a
|
|
|
much
smaller vector that still contains
|
|
|
most
of the information about the
|
|
|
content
of the document.
|
|