Why greedy learning works
Each time we learn a new layer, the inference at the
layer below becomes incorrect, but the variational bound
on the log prob of the data improves.
Since the bound starts as an equality, learning a new
layer never decreases the log prob of the data, provided
we start the learning from the tied weights that
implement the complementary prior.
Now that we have a guarantee we can loosen the
restrictions and still feel confident.
Allow layers to vary in size.
Do not start the learning at each layer from the
weights in the layer below.