Another explanation of the contrastive
divergence learning procedure
Think of an RBM as an infinite sigmoid belief net with
tied weights.
If we start at the data, alternating Gibbs sampling
computes samples from the posterior distribution in each
hidden layer of the infinite net.
In deeper layers the derivatives w.r.t. the weights are
very small.
Contrastive divergence just ignores these small
derivatives in the deeper layers of the infinite net.
Its silly to compute the derivatives exactly when you
know the weights are going to change a lot.