Greedily learning one layer at a time scales well to really
big networks, especially if we have locality in each layer.
Most of the information in the final weights comes from
modeling the distribution of input vectors.
The precious information in the labels is only used for
the final fine-tuning.
We do not start backpropagation until we already have
sensible weights that already do well at the task.
So the learning is well-behaved and quite fast.