2007 NIPS Tutorial on:
Deep Belief Nets

Some things you will learn in this tutorial

A spectrum of machine learning tasks

Historical background:
First generation neural networks

Second generation neural networks (~1985)

A temporary digression

What is wrong with back-propagation?

Overcoming the limitations of  back-propagation

 Belief Nets

Stochastic binary units
(Bernoulli variables)

 Learning Deep Belief Nets

The learning rule for sigmoid belief nets

Explaining away (Judea Pearl)

Why it is usually very hard to learn     sigmoid belief nets one layer at a time

Two types of generative neural network

Restricted Boltzmann Machines
(Smolensky ,1986, called them “harmoniums”)

The Energy of a joint configuration
(ignoring terms to do with biases)

Weights à Energies à Probabilities

Using energies to define probabilities

A picture of the maximum likelihood learning algorithm for an RBM

A quick way to learn an RBM

How to learn a set of features that are good for reconstructing images of the digit 2

Slide 23

How well can we reconstruct the digit images from the binary feature activations?

Three ways to combine probability density models (an underlying theme of the tutorial)

Training a deep network
(the main reason RBM’s are interesting)

The generative model after learning 3 layers

Why does greedy learning work?        An aside: Averaging factorial distributions

Why does greedy learning work?

Why does greedy learning work?

Which distributions are factorial in a directed belief net?

Why does greedy learning fail in a directed module?

A model of digit recognition

Fine-tuning with a contrastive version of the “wake-sleep” algorithm

Show the movie of the network generating digits

 (available at www.cs.toronto/~hinton)

Samples generated by letting the associative memory run with one label clamped. There are 1000 iterations of alternating Gibbs sampling between samples.

Examples of correctly recognized handwritten digits
that the neural network had never seen before

How well does it discriminate on MNIST test set with no extra information about geometric distortions?

Unsupervised “pre-training” also helps for models that have more data and better priors

Another view of why layer-by-layer    learning works

An infinite sigmoid belief net that is equivalent to an RBM

Inference in a directed net with replicated weights

"The learning rule for a..."

Learning a deep directed network

"Then freeze the first layer..."

How many layers should we use and how wide should they be?
(I am indebted to Karl Rove for this slide)

What happens when the weights in higher layers become different from the weights in the first layer?

A stack of RBM’s
(Yee-Whye Teh’s idea)

The variational bound

Summary so far

Overview of the rest of the tutorial


Fine-tuning for discrimination

Why backpropagation works better after greedy pre-training

First, model the distribution of digit images

Results on permutation-invariant MNIST task

Combining deep belief nets with Gaussian processes

Learning to extract the orientation of a face patch (Salakhutdinov & Hinton, NIPS 2007)

The training and test sets

The root mean squared error in the orientation when combining GP’s with deep belief nets

Modeling real-valued data

The free-energy of a mean-field logistic unit

An RBM with real-valued visible units

Deep Autoencoders
(Hinton & Salakhutdinov, 2006)

A comparison of methods for compressing digit images to 30 real numbers.

Do the 30-D codes found by the deep autoencoder preserve the class structure of the data?

Slide 67

Retrieving documents that are similar to a query document

How to compress the count vector

Performance of the autoencoder at document retrieval

Proportion of retrieved documents in same class as query

Slide 72

Slide 73

Finding binary codes for documents

Semantic hashing: Using a deep autoencoder as a hash-function for finding approximate matches (Salakhutdinov & Hinton, 2007)

How good is a shortlist found this way?

Time series models

Time series models

The conditional RBM model
(Sutskever & Hinton 2007)

Why the autoregressive connections do not cause problems

Generating from a learned model

Stacking temporal RBM’s

An application to modeling
motion capture data
(Taylor, Roweis & Hinton, 2007)

Modeling multiple types of motion

Show Graham Taylor’s movies

available at www.cs.toronto/~hinton

Generating the parts of an object

Semi-restricted Boltzmann Machines

Learning a semi-restricted Boltzmann Machine

Learning in Semi-restricted Boltzmann Machines

Results on modeling natural image patches using a stack of RBM’s (Osindero and Hinton)

Without lateral connections

With lateral connections

A funny way to use an MRF

Why do we whiten data?

Whitening the learning signal instead of the data

Towards a more powerful, multi-linear stackable learning module

Higher order Boltzmann machines (Sejnowski, ~1986)

A picture of a conditional,
 higher-order Boltzmann machine
(Hinton & Lang,1985)

Using conditional higher-order Boltzmann machines to model image transformations (Memisevic and Hinton, 2007)

Readings on deep belief nets