NIPS 2007 Tutorial on Deep Belief Nets


	What happens when the weights in higher layers
become different from the weights in the first layer?

•

The higher layers no longer implement a complementary

prior.

–

So performing inference using the frozen weights in

the first layer is no longer correct.

–

Using this incorrect inference procedure gives a

variational lower bound on the log probability of the

data.

•

We lose by the slackness of the bound.

•

The higher layers learn a prior that is closer to the

aggregated posterior distribution of the first hidden layer.

–

This improves the network’s model of the data.

•

Hinton, Osindero and Teh (2006) prove that this

improvement is always bigger than the loss.