Why its hard to learn one layer at a time
To learn W, we need the posterior
distribution in the first hidden layer.
Problem 1: The posterior is typically
intractable because of “explaining
away”.
Problem 2: The posterior depends
on the prior as well as the likelihood.
So to learn W, we need to know
the weights in higher layers, even
if we are only approximating the
posterior. All the weights interact.
Problem 3: We need to integrate
over all possible configurations of
the higher variables to get the prior
for first hidden layer. Yuk!
hidden variables
hidden variables
prior
hidden variables
W
          data