How to compute a predictive distribution
across 17000 words.
100-D
•
The hidden units predict a
point in the 100-dimensional
feature space.
•
The probability of each word
then depends on how close
its feature vector is to this
predicted point.