Mode averaging
If we generate from the model,
half the instances of a 1 at the
data layer will be caused by a
(1,0) at the hidden layer and half
will be caused by a (0,1).
So the recognition weights
will learn to produce (0.5,0.5)
This represents a distribution
that puts half its mass on
very improbable hidden
configurations.
Its much better to just pick one
mode and pay one bit.
-10                              -10
        +20             +20
        -20
minimum of
KL(Q||P)
minimum of
KL(P||Q)
P