NIPS 2007 Tutorial on Deep Belief Nets


Why backpropagation works better after

	greedy pre-training

•

Greedily learning one layer at a time scales well to really

big networks, especially if we have locality in each layer.

•

We do not start backpropagation until we already have

sensible weights that already do well at the task.

–

So the initial gradients are sensible and backprop only

needs to perform a local search.

•

Most of the information in the final weights comes from

modeling the distribution of input vectors.

–

The precious information in the labels is only used for

the final fine-tuning. It slightly modifies the features. It

does not need to discover features.

–

This type of backpropagation works well even if most of

the training data is unlabeled. The unlabeled data is

still very useful for discovering good features.