Multilayer contrastive divergence
Start by learning one hidden layer.
Then re-present the data as the activities of the
hidden units.
The same learning algorithm can now be
applied to the re-presented data.
Can we prove that each step of this greedy
learning improves the log probability of the data
under the overall model?
What is the overall model?