Why does layer-by-layer learning work?
The weights, W,  in the bottom level RBM define p(v|h)
and they also, indirectly, define p(h).
So we can express the RBM model as
conditional
probability
joint
probability
index over all
hidden vectors
If we leave p(v|h) alone and build a better model of p(h),
we will improve p(v).
We need a better model of the posterior hidden vectors
produced by applying W to the data.