Using backpropagation for fine-tuning
Greedily learning one layer at a time scales well to really
big networks, especially if we have locality in each layer.
We do not start backpropagation until we already have
sensible weights that already do well at the task.
So the initial gradients are sensible and
backpropagation only needs to perform a local search.
Most of the information in the final weights comes from
modeling the distribution of input vectors.
The precious information in the labels is only used for
the final fine-tuning. It slightly modifies the features. It
does not need to discover features.