Why does greedy learning work?
An aside: Averaging factorial distributions
If you average some factorial distributions, you
do NOT get a factorial distribution.
In an RBM, the posterior over the hidden units
is factorial for each visible vector.
But the aggregated posterior over all training
cases is not factorial (even if the data was
generated by the RBM itself).