Mode averaging
If we generate from the model,
half the instances of a 1 at the
data layer will be caused by a
(1,0) at the hidden layer and half
will be caused by a (0,1).
So the recognition weights
will learn to produce (0.5,0.5)
This represents a distribution
that puts half its mass on
very improbable hidden
configurations.
Its much better to just pick one
mode. This is the best
recognition model you can get if
you assume that the posterior
over hidden states factorizes.
-10                              -10
        +20             +20
        -20
Mode
averaging
True
posterior