Combining networks
When the amount of training data is limited, we need to
avoid overfitting.
Averaging the predictions of many different networks
is a good way to do this.
It works best if the networks are as different as
possible.
If the data is really a mixture of several different
“regimes” it is helpful to identify these regimes and use a
separate, simple model for each regime.
We want to use the desired outputs to help cluster
cases into regimes. Just clustering the inputs is not as
efficient.