Why we need backpropagation
Networks without hidden units are very limited in the
input-output mappings they can model.
More layers of linear units do not help. Its still linear.
Fixed output non-linearities are not enough
We need multiple layers of adaptive non-linear hidden
units.  This gives us a universal approximator. But how
can we train such nets?
We need an efficient way of adapting all the weights,
not just the last layer. This is hard. Learning the
weights going into hidden units is equivalent to
learning features.
Nobody is telling us directly what hidden units should
do.