CSC2515 Fall 2007
 Introduction to Machine Learning

Lecture 8 Deep Belief Nets

Three ways to combine probability density models

 Belief Nets

Stochastic binary neurons

 Learning Belief Nets

The learning rule for sigmoid belief nets

Explaining away (Judea Pearl)

Why it is usually very hard to learn     sigmoid belief nets one layer at a time

Two types of generative neural network

Restricted Boltzmann Machines

The Energy of a joint configuration
(ignoring terms to do with biases)

Weights à Energies à Probabilities

Using energies to define probabilities

A picture of the maximum likelihood learning algorithm for an RBM

A quick way to learn an RBM

How to learn a set of features that are good for reconstructing images of the digit 2

Slide 17

How well can we reconstruct the digit images from the binary feature activations?

Training a deep network

The generative model after learning 3 layers

Why does greedy learning work?

What does each RBM achieve?

A neural model of digit recognition

Fine-tuning with a contrastive divergence version of the “wake-sleep” algorithm

Show the movie of the network generating digits


 (available at www.cs.toronto/~hinton)

Samples generated by letting the associative memory run with one label clamped. There are 1000 iterations of alternating Gibbs sampling between samples.

Examples of correctly recognized handwritten digits
that the neural network had never seen before

How well does it discriminate on MNIST test set with no extra information about geometric distortions?

Another view of why layer-by-layer    learning works

An infinite sigmoid belief net that is equivalent to an RBM

Inference in a directed net with replicated weights

"The learning rule for a..."

Learning a deep directed network

"Then freeze the first layer..."

What happens when the weights in higher layers become different from the weights in the first layer?

A stack of RBM’s
(Yee-Whye Teh’s picture)

The variational bound

Summary so far

Fine-tuning for discrimination

Why backpropagation works better after greedy pre-training

First, model the distribution of digit images

Results on permutation-invariant MNIST task

Combining deep belief nets with Gaussian processes

Learning to extract the orientation of a face patch (Ruslan Salakhutdinov)

The training and test sets

The root mean squared error in the orientation when combining GP’s with deep belief nets

Modeling real-valued data

The free-energy of a mean-field logistic unit

An RBM with real-valued visible units

Deep Autoencoders
(Ruslan Salakhutdinov)

A comparison of methods for compressing digit images to 30 real numbers.

Do the 30-D codes found by the deep autoencoder preserve the class structure of the data?

Slide 53

Retrieving documents that are similar to a query document

How to compress the count vector

Performance of the autoencoder at document retrieval

Proportion of retrieved documents in same class as query

Slide 58

Slide 59

Finding binary codes for documents

Using a deep autoencoder as a hash-function for finding approximate matches

How good is a shortlist found this way?

Time series models

Time series models

The conditional RBM model

Why the lateral connections do not cause problems

Generating from a learned model

Stacking temporal RBM’s

An application to modeling
motion capture data

Modeling multiple types of motion

Show Graham Taylor’s movies

Generating the parts of an object

Semi-restricted Boltzmann Machines

Learning in Semi-restricted Boltzmann Machines

Learning a semi-restricted Boltzmann Machine

Results on modeling natural image patches using a stack of RBM’s (Osindero and Hinton)

Without lateral connections

With lateral connections

A funny way to use an MRF

Why do we whiten data?

Whitening the learning signal instead of the data

Higher order Boltzmann machines

A picture of a conditional,
 higher-order Boltzmann machine (1985)

Using conditional higher-order Boltzmann machines to model image transformations (Memisevic and Hinton, 2007)