 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| • |
E-step: Compute the output of each expert and the
|
|
|
“prior”
probabilities provided by each gating net. Then
|
|
|
combine the
prior probabilities with the probability that
|
|
|
each expert
assigns to the correct answer. This gives
|
|
|
posterior
probabilities for each expert
and each gating
|
|
|
net.
|
|
|
– |
(see
the Jordan and Jacobs paper for the math)
|
|
|
| • |
M-step: Refit each expert to the data weighted by
the
|
|
|
posterior
probability that each datapoint
came from that
|
|
|
expert. Refit
each gating network to minimize the cross-
|
|
entropy between
the “prior” that it provides and the
|
|
|
posterior
distribution computed in the
E-step.
|
|
|
– |
This
requires IRLS which is iterative but converges
|
|
|
rapidly
to the global optimum (of this sub-problem).
|
|
|
– |
See
the textbook page 207 for IRLS
|
|