 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
• |
Greedily
learning one layer at a time scales well to really
|
|
|
big networks,
especially if we have locality in each layer.
|
|
|
• |
We do not start
backpropagation until we already have
|
|
|
sensible weights
that already do well at the task.
|
|
|
|
– |
So
the initial gradients are sensible and
|
|
|
backpropagation
only needs to perform a local search.
|
|
• |
Most of the
information in the final weights comes from
|
|
|
modeling the
distribution of input vectors.
|
|
|
|
– |
The
precious information in the labels is only used for
|
|
|
the
final fine-tuning. It slightly modifies the features. It
|
|
|
does
not need to discover features.
|
|