A solution to all of these problems
Use greedy unsupervised learning to find a sensible set of
weights one layer at a time. Then fine-tune with
backpropagation.
Greedily learning one layer at a time scales well to really
deep networks.
Most of the information in the final weights comes from
modeling the distribution of input vectors.
The precious information in the labels is only used for
the final fine-tuning.
We do not start backpropagation until we already have
sensible weights that already do well at the task.
So the fine-tuning is well-behaved and quite fast.