Extra problems that occur in multilayer non-
linear networks
If we start with a big learning rate, the bias and all of the
weights for one of the output units may become very
negative.
The output unit is then very firmly off and it will never
produce a significant error derivative.
So it will never recover (unless we have weight-
decay).
In classification networks that use a squared error, the
best guessing strategy is to make each output unit
produce an output equal to the proportion of time it
should be a 1.
The network finds this strategy quickly and takes a
long time to improve on it. So it looks like a local
minimum.