The flaws in the wake-sleep algorithm
The recognition weights are trained to invert the
generative model in parts of the space where there is no
data.
This is wasteful.
The recognition weights follow the gradient of the wrong
divergence. They minimize KL(P||Q) but the variational
bound requires minimization of KL(Q||P).
This leads to incorrect mode-averaging
The posterior over the top hidden layer is very far from
independent because the independent prior cannot
eliminate explaining away effects.