A different (and better?) type of hierarchy for
a mixture of experts
Instead of just using a hierarchy of gating nets, also use
a hierarchy of experts.
Learn the whole system by greedy divide-and-conquer.
Start by learning a single expert.
Then make two slightly different copies of the expert,
and use EM to rapidly fit an MoE with one gating
network and two experts.
Now split each of these two experts. Use the previous
gating network as the initial top-level gating net and
add two new gating nets (with zero weights) at the
next level down.