Why early stopping works
When the weights are very
small, every hidden unit is in
its linear range.
So a net with a large layer
of hidden units is linear.
It has no more capacity
than a linear net in which
the inputs are directly
connected to the outputs!
As the weights grow, the
hidden units start using their
non-linear ranges so the
capacity grows.
outputs
inputs