Why greedy learning works
Each time we learn a new layer, the inference at
the layer below becomes incorrect, but the
variational bound on the log prob of the data
improves provided we start the learning from the
tied weights that implement the complementary
prior.
Now that we have a guarantee we can loosen
the restrictions and still feel confident.
Allow layers to vary in size.
Do not start the learning at each layer from
the weights in the layer below.