 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
• |
Greedily
learning one layer at a time scales well to really
|
|
|
big networks,
especially if we have locality in each layer.
|
|
• |
We do not start
backpropagation until we already have
|
|
|
sensible weights
that already do well at the task.
|
|
|
– |
So
the initial gradients are sensible and backprop only
|
|
|
needs
to perform a local search.
|
|
• |
Most of the
information in the final weights comes from
|
|
|
modeling the
distribution of input vectors.
|
|
|
– |
The
precious information in the labels is only used for
|
|
|
the
final fine-tuning. It slightly modifies the features. It
|
|
|
does
not need to discover features.
|
|
|
– |
This
type of backpropagation works well even if most of
|
|
the
training data is unlabeled. The unlabeled data is
|
|
|
still
very useful for discovering good features.
|
|