 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
• |
The higher
layers no longer implement a complementary prior.
|
|
|
|
– |
So
performing inference using W0 transpose is no longer
|
|
|
correct.
|
|
|
|
– |
Using
this incorrect inference procedure gives a variational
|
|
|
lower
bound on the log probability of the data.
|
|
|
|
• |
We lose
by the slackness of the bound.
|
|
|
• |
The higher
layers learn a prior that is closer to the aggregated
|
|
|
posterior
distribution of the first hidden layer.
|
|
|
|
– |
This
improves the variational bound on the network’s model of
|
|
the
data.
|
|
|
|
• |
Hinton,
Osindero and Teh (2006) prove that the improvement is
|
|
|
always
bigger than the loss.
|
|