lecture 24


How to find documents that are similar to a

	query document

•

Convert each document into a “bag of

words”.

–

This is a vector of word counts

ignoring the order.

–

Ignore stop words (like “the” or “over”)

•

We could compare the word counts of

the query document and millions of other

documents but this is too slow.

–

So we reduce each query vector to a

much smaller vector that still contains

most of the information about the

content of the document.


fish

cheese
vector
count
school
query
reduce
bag
pulpit
iraq
word

0
0
2
2
0
2
1
1
0
0
2