Using EM to fit an HMoE
E-step: Compute the output of each expert and the
“prior” probabilities provided by each gating net. Then
combine the prior probabilities with the probability that
each expert assigns to the correct answer. This gives
posterior probabilities for each expert and each gating
net.
(see the Jordan and Jacobs paper for the math)
M-step: Refit each expert to the data weighted by the
posterior probability that each datapoint came from that
expert. Refit each gating network to minimize the cross-
entropy between the “prior” that it provides and the
posterior distribution computed in the E-step.
This requires IRLS which is iterative but converges
rapidly to the global optimum (of this sub-problem).
See the textbook page 207 for IRLS