


















• 
The higher
layers no longer implement a complementary


prior.



– 
So
performing inference using the frozen weights in



the
first layer is no longer correct.



– 
Using
this incorrect inference procedure gives a



variational lower bound on the log probability of the



data.



• 
We lose
by the slackness of the bound.



• 
The higher
layers learn a prior that is closer to the



aggregated
posterior distribution of the first hidden layer.


– 
This
improves the network’s model of the data.



• 
Hinton,
Osindero and Teh (2006) prove that this



improvement
is always bigger than the loss.

