Is an HMoE better than a flat MoE?
If we use the simple gating networks that can be fitted
rapidly by IRLS, is an HMoE really more powerful than a
flat MoE?
The textbook says it is but doesn’t say why.
An HMoE that uses a binary tree has the same
number of degrees of freedom in the path
probabilities as a single flat softmax over all experts.
But does the dependence on the input vector make
the two ways of doing the gating different?