Why does greedy learning work?
Each RBM converts its data distribution
into an aggregated posterior distribution
over its hidden units.
This divides the task of modeling its
data into two tasks:
Task 1: Learn generative weights
that can convert the aggregated
posterior distribution over the hidden
units back into the data distribution.
Task 2: Learn to model the
aggregated posterior distribution
over the hidden units.
The RBM does a good job of task 1
and a moderately good job of task 2.
Task 2 is easier (for the next RBM) than
modeling the original data because the
aggregated posterior distribution is
closer to a distribution that an RBM can
model perfectly.
Task 2
     aggregated
posterior distribution
on hidden units
Task 1
data distribution
on visible units