Why EM converges
• There is a cost function that is reduced by both the E-step
and the M-step.
                 Cost  =  expected energy  –  entropy
• The expected energy term measures how difficult it is to
generate each datapoint from the Gaussians it is assigned
to. It would be happiest giving all the responsibility for each
datapoint to the most likely Gaussian (as in K-means).
• The entropy term encourages “soft” assignments. It would
be happiest spreading the responsibility for each datapoint
equally between all the Gaussians.