














• 
The recognition
weights are trained to invert the



generative model
in parts of the space where there is no


data.




– 
This
is wasteful.



• 
The recognition
weights follow the gradient of the wrong



divergence. They
minimize KL(PQ) but the
variational



bound requires
minimization of KL(QP).




– 
This
leads to incorrect modeaveraging



• 
The posterior
over the top hidden layer is very far from



independent
because the independent prior cannot



eliminate
explaining away effects.

