 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| • |
There is a very
efficient way to learn an HMoE if we
|
|
|
make two
assumptions:
|
|
| • |
Linear
experts: make every expert give
an output that is
|
|
|
a linear
function of the input, and use a squared error.
|
|
|
– |
This
makes it possible to fit an expert non-iteratively if
|
|
we
know how much responsibility it has for each
|
|
|
training
case.
|
|
| • |
Generalized
linear gating networks: make
each expert
|
|
|
be a softmax
applied to a linear transformation of the
|
|
|
input vector.
|
|
|
– |
This
makes it possible to fit each gating network
|
|
|
quickly
if we know what probabilities it should output
|
|
|
for
each case. The cost function is convex.
|
|
|
– |
The
fitting uses IRLS – iterative recursive least
|
|
|
squares.
|
|