Training a deep network
(the main reason RBM’s are interesting)
First train a layer of features that receive input directly
from the pixels.
Then treat the activations of the trained features as if
they were pixels and learn features of features in a
second hidden layer.
It can be proved that each time we add another layer of
features we improve a variational lower bound on the log
probability of the training data.
The proof is slightly complicated.
But it is based on a neat equivalence between an
RBM and a deep directed model (described later)