the next generation of neural networks


	What happens when the weights in higher layers
become different from the weights in the first layer?

•

The higher layers no longer implement a complementary prior.

–

So performing inference using W0 transpose is no longer

correct.

–

Using this incorrect inference procedure gives a variational

lower bound on the log probability of the data.

•

We lose by the slackness of the bound.

•

The higher layers learn a prior that is closer to the aggregated

posterior distribution of the first hidden layer.

–

This improves the variational bound on the network’s model of

the data.

•

Hinton, Osindero and Teh (2006) prove that the improvement is

always bigger than the loss.