Multilayer contrastive divergence
Start by learning one hidden layer.
Then re-present the data as the activities of the
hidden units.
The same learning algorithm can now be
applied to the re-presented data.
Can we prove that each step of this greedy
learning improves a bound on the log probability
of the data under the overall model?
What is the overall model?