Explanation of the face movies

One of the neural networks developed by Ruslan Salakhutdinov, Simon Osindero and Geoff Hinton learns a generative model that looks like this:
1000 top-level binary units <--> 1000 binary units --> 2000 binary units -->625 Gaussian pixels.

Generating unbiased samples from the model is a slow process. It involves many iterations of alternating Gibbs sampling to get a sample from the top-level associative memory that consists of 1000 top-level binary units symmetrically connected to the 1000 binary units below them. Then we use the directed connections from the lower 1000 binary units to stochastically activate the 2000 binary units and we use the 2000 binary units to generate pixel intensities. Fortunately, the generative model can be learned without ever having to generate unbiased samples from the model.

Before the movie starts, the top level units are initialised to random binary states. It takes a while before the alternating Gibbs sampling in the top two layers approaches the equilibrium distribution. To reduce noise, we do not add Gaussian noise to the pixels during generation.

Some of the generated images do not look at all like faces, but in a generative model we can throw away three-quarters of the images it generates at the cost of only two bits in the log probability assigned to each image.

The training data consists of 120,000 images derived from the 400 olivetti face images of 40 different people by taking large square patches from these images. The squares vary in scale, orientation and position. This does a good job of screwing-up a full-covariance Gaussian model of the pixels (or PCA) which works quite well on carefully aligned images.

Play the movie of the network generating images of face patches