Learning a simplified HMoE
There is a very efficient way to learn an HMoE if we
make two assumptions:
Linear experts: make every expert give an output that is
a linear function of the input, and use a squared error.
This makes it possible to fit an expert non-iteratively if
we know how much responsibility it has for each
training case.
Generalized linear gating networks: make each expert
be a softmax applied to a linear transformation of the
input vector.
This makes it possible to fit each gating network
quickly if we know what probabilities it should output
for each case. The cost function is convex.
The fitting uses IRLS – iterative recursive least
squares.