What happens when the weights in higher layers
become different from the weights in the first layer?
The higher layers no longer implement a complementary
So performing inference using the frozen weights in
the first layer is no longer correct.
Using this incorrect inference procedure gives a
variational  lower bound on the log probability of the
We lose by the slackness of the bound.
The higher layers learn a prior that is closer to the
aggregated posterior distribution of the first hidden layer.
This improves the network’s model of the data.
Hinton, Osindero and Teh (2006) prove that this
improvement is always bigger than the loss.