lecture 7

Is an HMoE better than a flat MoE?

•

If we use the simple gating networks that can be fitted

rapidly by IRLS, is an HMoE really more powerful than a

flat MoE?

–

The textbook says it is but doesn’t say why.

–

An HMoE that uses a binary tree has the same

number of degrees of freedom in the path

probabilities as a single flat softmax over all experts.

–

But does the dependence on the input vector make

the two ways of doing the gating different?