Training a deep network
First train a layer of features that receive input directly
from the pixels.
Then treat the activations of the trained features as if
they were pixels and learn features of features in a
second hidden layer.
It can be proved that each time we add another layer of
features we get a better model of the set of training
images.
i.e. we assign lower energy to the real data and
higher energy to all other possible images.
The proof is complicated. It uses variational free
energy, a method that physicists use for analyzing
complicated non-equilibrium systems.
But it is based on a neat equivalence.