What happens when the weights in higher layers
become different from the weights in the first layer?
The higher layers no longer implement a complementary prior.
So performing inference using W0 transpose is no longer
correct.
Using this incorrect inference procedure gives a variational
lower bound on the log probability of the data.
We lose by the slackness of the bound.
The higher layers learn a prior that is closer to the aggregated
posterior distribution of the first hidden layer.
This improves the variational bound on the network’s model of
the data.
Hinton, Osindero and Teh (2006) prove that the improvement is
always bigger than the loss.