Mixtures of Experts
Can we do better that just averaging predictors in a way
that does not depend on the particular training case?
Maybe we can look at the input data for a particular
case to help us decide which model to rely on.
This may allow particular models to specialize in a subset of
the training cases. They do not learn on cases for which they
are not picked. So they can ignore stuff they are not good at
modeling.
The key idea is to make each expert focus on predicting
the right answer for the cases where it is already doing
better than the other experts.
This causes specialization.
If we always average all the predictors, each model is
trying to compensate for the combined error made by
all the other models.