 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
• |
The higher
layers no longer implement a complementary
|
|
prior.
|
|
|
– |
So
performing inference using the frozen weights in
|
|
|
the
first layer is no longer correct.
|
|
|
– |
Using
this incorrect inference procedure gives a
|
|
|
variational lower bound on the log probability of the
|
|
|
data.
|
|
|
• |
We lose
by the slackness of the bound.
|
|
|
• |
The higher
layers learn a prior that is closer to the
|
|
|
aggregated
posterior distribution of the first hidden layer.
|
|
– |
This
improves the network’s model of the data.
|
|
|
• |
Hinton,
Osindero and Teh (2006) prove that this
|
|
|
improvement
is always bigger than the loss.
|
|