The non-linearity used for reconstructing
bags of words
Divide the counts in a bag of words vector by N, where N
is the total number of non-stop words in the document.
The resulting probability vector gives the probability of
getting a particular word if we pick a non-stop word at
random from the document.
At the output of the autoencoder, we use a softmax.
The probability vector defines the desired outputs of
the softmax.
When we train the first RBM in the stack we use the
same trick.
We treat the word counts as probabilities, but we
make the visible to hidden weights N times bigger
than the hidden to visible because we have N
observations from the probability distribution.