The effect of weight-decay
It prevents the network from using weights that it does
not need.
This can often improve generalization a lot.
It helps to stop it from fitting the sampling error.
It makes a smoother model in which the output
changes more slowly as the input changes. w
If the network has two very similar inputs it prefers to put
half the weight on each rather than all the weight on one.
w/2
w
w/2
0