The generative model for an HMoE
First let the top-level gate choose one branch of
the gating tree, using the input to determine the
relative probabilities
Then use the next gating network down in the
chosen branch, etc.
Finally, generate an output from the expert at the
chosen leaf node.