The flaws in the wake-sleep algorithm
The recognition weights are trained to invert the
generative model in parts of the space where
there is no data.
This is wasteful.
The recognition weights follow the gradient of
the wrong divergence. They minimize KL(P||Q)
but the variational bound requires minimization
of KL(Q||P).
This leads to incorrect mode-averaging.