Learning multiple layers of features greedily

Why does layer-by-layer learning work?


The weights, W, in the bottom level RBM define p(v\|h)
and they also, indirectly, define p(h).

So we can express the RBM model as


If we leave p(v\|h) alone and build a better model of p(h),
we will improve p(v).

We need a better model of the posterior hidden vectors
produced by applying W to the data.