2007 NIPS Tutorial on:
Deep Belief Nets
Some things you will learn in this tutorial
A spectrum of machine learning tasks
Historical background:
First generation neural networks
Second generation neural networks (~1985)
What is wrong with back-propagation?
Overcoming the limitations of back-propagation
Stochastic binary
units
(Bernoulli variables)
The learning rule for sigmoid belief nets
Why it is usually very hard to learn sigmoid belief nets one layer at a time
Two types of generative neural network
Restricted Boltzmann
Machines
(Smolensky ,1986, called them “harmoniums”)
The Energy of a joint
configuration
(ignoring terms to do with biases)
Weights à Energies à Probabilities
Using energies to define probabilities
A picture of the maximum likelihood learning algorithm for an RBM
How to learn a set of features that are good for reconstructing images of the digit 2
How well can we reconstruct the digit images from the binary feature activations?
Three ways to combine probability density models (an underlying theme of the tutorial)
Training a deep
network
(the main reason RBM’s are interesting)
The generative model after learning 3 layers
Why does greedy learning work? An aside: Averaging factorial distributions
Why does greedy learning work?
Why does greedy learning work?
Which distributions are factorial in a directed belief net?
Why does greedy learning fail in a directed module?
Fine-tuning with a contrastive version of the “wake-sleep” algorithm
Show the movie of the
network generating digits
(available at www.cs.toronto/~hinton)
Examples of correctly
recognized handwritten digits
that the neural network had never seen before
Unsupervised “pre-training” also helps for models that have more data and better priors
Another view of why layer-by-layer learning works
An infinite sigmoid belief net that is equivalent to an RBM
Inference in a directed net with replicated weights
Learning a deep directed network
"Then freeze the first layer..."
What happens when the weights in higher layers become different from the weights in the first layer?
A stack of RBM’s
(Yee-Whye Teh’s idea)
Overview of the rest of the tutorial
Fine-tuning for discrimination
Why backpropagation works better after greedy pre-training
First, model the distribution of digit images
Results on permutation-invariant MNIST task
Combining deep belief nets with Gaussian processes
Learning to extract the orientation of a face patch (Salakhutdinov & Hinton, NIPS 2007)
The root mean squared error in the orientation when combining GP’s with deep belief nets
The free-energy of a mean-field logistic unit
An RBM with real-valued visible units
Deep Autoencoders
(Hinton & Salakhutdinov, 2006)
A comparison of methods for compressing digit images to 30 real numbers.
Do the 30-D codes found by the deep autoencoder preserve the class structure of the data?
Retrieving documents that are similar to a query document
How to compress the count vector
Performance of the autoencoder at document retrieval
Proportion of retrieved documents in same class as query
Finding binary codes for documents
How good is a shortlist found this way?
The conditional RBM model
(Sutskever & Hinton 2007)
Why the autoregressive connections do not cause problems
Generating from a learned model
An application to modeling
motion capture data
(Taylor, Roweis & Hinton, 2007)
Modeling multiple types of motion
Show Graham Taylor’s
movies
available at www.cs.toronto/~hinton
Generating the parts of an object
Semi-restricted Boltzmann Machines
Learning a semi-restricted Boltzmann Machine
Learning in Semi-restricted Boltzmann Machines
Results on modeling natural image patches using a stack of RBM’s (Osindero and Hinton)
Whitening the learning signal instead of the data
Towards a more powerful, multi-linear stackable learning module
Higher order Boltzmann machines (Sejnowski, ~1986)
A picture of a
conditional,
higher-order Boltzmann machine
(Hinton & Lang,1985)