lecture 7

Using EM to fit an HMoE

•

E-step: Compute the output of each expert and the

“prior” probabilities provided by each gating net. Then

combine the prior probabilities with the probability that

each expert assigns to the correct answer. This gives

posterior probabilities for each expert and each gating

net.

–

(see the Jordan and Jacobs paper for the math)

•

M-step: Refit each expert to the data weighted by the

posterior probability that each datapoint came from that

expert. Refit each gating network to minimize the cross-

entropy between the “prior” that it provides and the

posterior distribution computed in the E-step.

–

This requires IRLS which is iterative but converges

rapidly to the global optimum (of this sub-problem).

–

See the textbook page 207 for IRLS