 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| • |
If we start with
a big learning rate, the bias and all of the
|
|
weights for one
of the output units may become very
|
|
|
negative.
|
|
|
– |
The
output unit is then very firmly off and it will never
|
|
|
produce
a significant error derivative.
|
|
|
– |
So
it will never recover (unless we have weight-
|
|
|
decay).
|
|
| • |
In
classification networks that use a squared error, the
|
|
|
best guessing
strategy is to make each output unit
|
|
|
produce an
output equal to the proportion of time it
|
|
|
should be a 1.
|
|
|
– |
The
network finds this strategy quickly and takes a
|
|
|
long
time to improve on it. So it looks like a local
|
|
|
minimum.
|
|