 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| • |
If we use the
simple gating networks that can be fitted
|
|
|
rapidly by IRLS,
is an HMoE really more powerful than a
|
|
flat MoE?
|
|
|
|
– |
The
textbook says it is but doesn’t say why.
|
|
|
|
– |
An
HMoE that uses a binary tree has the same
|
|
|
number
of degrees of freedom in the path
|
|
|
probabilities
as a single flat softmax over all experts.
|
|
|
|
– |
But
does the dependence on the input vector make
|
|
|
the
two ways of doing the gating different?
|
|