lec10


Extra problems that occur in multilayer non-

	linear networks

•

If we start with a big learning rate, the bias and all of the

weights for one of the output units may become very

negative.

–

The output unit is then very firmly off and it will never

produce a significant error derivative.

–

So it will never recover (unless we have weight-

decay).

•

In classification networks that use a squared error, the

best guessing strategy is to make each output unit

produce an output equal to the proportion of time it

should be a 1.

–

The network finds this strategy quickly and takes a

long time to improve on it. So it looks like a local

minimum.