Learning multiple layers of features greedily

An analogy

•

In a mixture model, we define the probability of a datavector to be

•

The learning rule for the mixing proportions is to make them match

the posterior probability of using each Gaussian.

•

The weights of an RBM implicitly define a mixing proportion for each

possible hidden vector.

–

To fit the data better, we can leave p(v|h) the same and make

the mixing proportion of each hidden vector more like the

posterior over hidden vectors.