The flaws in the wake-sleep algorithm
The recognition weights are trained to invert the
generative model in parts of the space where there is no
data.
This is wasteful.
The recognition weights do not follow the gradient of the
log probability of the data. Nor do they follow the
gradient of a bound on this probability.
This leads to incorrect mode-averaging
The posterior over the top hidden layer is very far from
independent because the independent prior cannot
eliminate explaining away effects.