Training a deep network
First train a layer of features that receive input directly
from the pixels.
Then treat the activations of the trained features as if
they were pixels and learn features of features in a
second hidden layer.
It can be proved that each time we add another layer of
features we get a better model of the set of training
The proof is complicated. It uses variational free
energy, a method that physicists use for analyzing
complicated non-equilibrium systems.
But there is a simple intuitive explanation.