Why backpropagation works better after
greedy pre-training
Greedily learning one layer at a time scales well to really
big networks, especially if we have locality in each layer.
We do not start backpropagation until we already have
sensible weights that already do well at the task.
So the initial gradients are sensible and backprop only
needs to perform a local search.
Most of the information in the final weights comes from
modeling the distribution of input vectors.
The precious information in the labels is only used for
the final fine-tuning. It slightly modifies the features. It
does not need to discover features.
This type of backpropagation works well even if most of
the training data is unlabeled. The unlabeled data is
still very useful for discovering good features.