Why does greedy learning work?
Each RBM converts its data distribution
into a posterior distribution over its
hidden units.
This divides the task of modeling its
data into two tasks:
Task 1: Learn generative weights
that can convert the posterior
distribution over the hidden units
back into the data.
Task 2: Learn to model the posterior
distribution over the hidden units.
The RBM does a good job of task 1
and a not so good job of task 2.
Task 2 is easier (for the next RBM) than
modeling the original data because the
posterior distribution is closer to a
distribution that an RBM can model
Task 2
posterior distribution
on hidden units
Task 1
data distribution
on visible units