Fine-tuning with a contrastive divergence
version of the wake-sleep algorithm
Replace the top layer of the causal network by an RBM
This eliminates explaining away at the top-level.
It is nice to have an associative memory at the top.
Replace the sleep phase by a top-down pass starting with
the state of the RBM produced by the wake phase.
This makes sure the recognition weights are trained in
the vicinity of the data.
It also reduces mode averaging. If the recognition
weights prefer one mode, they will stick with that mode
even if the generative weights like some other mode
just as much.