CSC 2541 Fall 2016:

Differentiable Inference and Generative Models

Training a GAN [Images synthesized from a GAN]

Overview

In the last few years, new inference methods have allowed big advances in probabilistic generative models. These models let us generate novel images and text, find meaningful latent representations of data, take advantage of large unlabeled datasets, and even let us do analogical reasoning automatically. This course will tour recent innovations in inference methods such as recognition networks, black-box stochastic variational inference, and adversarial autoencoders. It will also cover recent advances in generative model design, such as deconvolutional image models, thought vectors, and recurrent variational autoencoders. The class will have a major project component.

Prerequisites

This course is designed to bring students to the current frontier of knowledge on these methods, so that ideally, their course projects can make a novel contribution. A previous background in machine learning such as CSC411 or ECE521 is strongly recommended. Linear algebra, basic multivariate calculus, basics of working with probability, and programming skills are required.

Where and When

Fall term 2016, Fridays 2:00-4:00pm
Room: Galbraith room 220
Instructor: David Duvenaud
Email: duvenaud@cs.toronto.edu (put "CSC2541" in the subject)
Office hours: Wednesdays 1:00-3:00pm, Room 384 Pratt
Teaching assistants: Tony Wu and Kamal Rai

What are generative models?

Generative modeling loosely refers to building a model of data, for instance p(image), that we can sample from. This is in contrast to discriminative modeling, such as regression or classification, which tries to estimate conditional distributions such as p(class | image).

Why generative models?

Even when we're only interested in making predictions, there are practical reasons to build generative models:

Data efficiency and semi-supervised learning - Generative models can reduce the amount of data required. As a simple example, building an image classifier p(class | image) requires estimating a very high-dimenisonal function, possibly requiring a lot of data, or clever assumptions. In contrast, we could model the data as being generated from some low-dimensional or sparse latent variables z, as in \(p(image) = \int p(image | z) p(z) dz\). Then, to do classification, we only need to learn p( class | z), which will usually be a much simpler function. This approach also lets us take advantage of unlabeled data - also known as semi-supervised learning.
Model checking by sampling - Understanding complex regression and classification models is hard - it's often not clear what these models have learned from the data and what they missed. There is a simple way to sanity-check and inspect generative models - simply sample from them, and compare the sampled data to the real data to see if anything is missing.
Understanding - Generative models usually assume that each datapoint is generated from a (usually low-dimensional) latent variable. These latent variables are often interpretable, and sometimes can tell us about the hidden causes of a phenomenon. These latent variables can also sometimes let us do interesting things such as interpolating between examples

Differentiable inference

We already know how to specify some expressive and flexible generative models, including entire languages of models that can express arbitarily complicated structure. However, until recently such models were hard to apply to real datasets, because inference methods (such as Markov chain Monte Carlo methods) were not usually fast or scalable enough to run on large models or even medium-sized datasets.

The past few years have seen major progress in methods to train and do inference in generative models, loosely following four strands:

Variational autoencoders - Latent-variable models that use a neural network to do approximate inference. The recognition network looks at each datapoint x and outputs an approximate posterior on the latents q(z | x) for that datapoint.
Generative adversarial networks - A way to train generative models by optimizing them to fool a classifier, the discriminator network, that tries to distinguish between real data and data generated by the model.
Invertible density estimation - A way to specify complex generative models by transforming a simple latent distribution with a series of invertible functions. These approaches are restricted to a more limited set of possible operations, but sidestep the difficult integrals required to train standard latent variable models.
Autoregressive models - Another way to model p(x) is to break the model into a series of conditional distributions: \(p(x) = p(x_1) p(x_2|x_1) p(x_3 | x_2, x_1) \dots\) This is the approach used, for example, by recurrent neural networks. These models are also realitvely easy to train, but the downside is that they don't support all of the same queries we can make of latent-variable models.

The common thread among these approaches that lets them scale to high-dimensional models is that their loss functions are end-to-end differentiable. This is in contrast to previous inference strategies such as MCMC or early variational inference strategies, which required alternating inference and optimization steps and didn't allow gradient-based tuning of the inference procedure.

These new inference schemes are allowing great progress in generative models of images and text.

Course Structure

After the first two lectures, each week a different student, or pair of students, will present on an aspect of these methods, using a couple of papers as reference. I'll provide guidance about the content of these presentations.

In-class discussion will center around:

Understanding the strengths and weaknesses of these methods.
Understanding the relationships between these methods, and with previous approaches.
Extensions or applications of these methods.
Experiments that might better illuminate their properties.

The hope is that these discussions will lead to actual research papers, or resources that will help others understand these approaches.

Grades will be based on:

Class presentations - 20%
Project proposal - 20% - Due Oct 14th
Project presentation - 20% - Nov 18th and 25th
Project report and code - 40% - Dec 10th

Project

Students can work on projects individually,in pairs, or even in triplets. The grade will depend on the ideas, how well you present them in the report, how clearly you position your work relative to existing literature, how illuminating your experiments are, and well-supported your conclusions are.

Each group of students will write a short (around 2 pages) research project proposal, which ideally will be structured similarly to a standard paper. It should include a description of a minimum viable project, some nice-to-haves if time allows, and a short review of related work. You don't have to do what your project proposal says - the point of the proposal is mainly to have a plan and to make it easy for me to give you feedback.

Towards the end of the course everyone will present their project in a short, roughly 5 minute, presentation.

At the end of the class you'll hand in a project report (around 4 to 8 pages), ideally in the format of a machine learning conference paper such as NIPS.

Project report grading rubric

Schedule

Sept 16th: Overview Lecture notes Quiz

This lecture will outline the motivation for the course and give a rough picture of the state of the field.
- Latent variable models
- Unsupervised and semi-supervised learning
- Strengths and limitations of probabilistic graphical models
- How specify generative models with neural networks
- History of generative models
Sept 23rd: Variational inference and recognition networks Lecture notes

This lecture will outline the main technical advance that has allowed latent-variable modeling to become practical: Variational autoencoders, in which the approximate inference procedure is specified by a neural network (or other differentiable procedure).

The difference between traditional variational methods and variational autoencoders is that in a variational autoencoder, the local approximate posterior, q(z_i|x_i) is produced by a closed-form differentiable procedure (such as a neural network), as opposed to a local optimization. This allows the model and inference strategy to be joinly optimized.
- Loss definition, KL divergence
- Stochastic Variational Inference
- Relation to standard autoencoders
- Pros and cons of different inference methods
Readings:
Sept 30th: Autoregressive and invertible models Lecture notes

It's possible to directly specify fairly complex models without integrating over any latent variables, if the entire generative procedure is invertible, or if it directly specifies a normalized probability.

Recurrent network-based generative models:
Invertible generative models:
- Density estimation using Real NVP
October 7th: Adversarial training Foundations slides Applications slides Frontiers slides

Adversarial training proposes a completely different training procedure for generative models, which relies on a 'discrimintator' to find ways in which data generated by the model is unrealistic.

Foundations:
- Generative Adversarial Networks - The paper that started it all.
- Jensen-Shannon divergence, relation to VAEs, pros and cons - A discussion of the objective being optimized in the original GAN paper.
- f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization - Shows how to optimize many different objectives using adversarial training.
Applications:
- Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks - A way to break the problem down into smaller pieces.
- Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network - The current state of the art, shows of nice properties of GANs.
- Generative Adversarial Text to Image Synthesis
Frontiers and related methods:
- Improved Techniques for Training GANs - Useful tricks.
- Adversarial Autoencoders - An attempt to address the "hole" problem.
- Adversarial Feature Learning - A framework for learning approximately invertible generative procedures using adversarial training.
October 14th: Structured encoder/decoders Slides

We have complete freedom in how we compute q(x | z). There is also currently a lot of exploration going on of different types of generative models, p(x, z).
- Importance-Weighted Autoencoders - The recognition network can return multiple weighted samples.
- Variational Inference with Normalizing Flows - The recognition network can include an invertible mapping to warp the initial Gaussian.
- Generating Sentences from a Continuous Space - The encoder and decoder can both be RNNs.
- DRAW: A Recurrent Neural Network For Image Generation - The encoder and decoder RNNs can read and write to an image.
- Convolution-deconvolution VAEs - The encoder and decoder can be a convnet and its pseudo-inverse.
- Auxiliary Deep Generative Models - The model can be augmented with extra random variables that are then integrated out.
October 21st: Structured latent variables 3D Latent rep Slides AIR Slides SVAE Slides

At first, variational autoencoders had only vector-valued latent variables z, in which the different dimensions had no special meaning. People are starting to explore ways to put more meaningful structure on the latent description of data.
- Unsupervised Learning of 3D Structure from Images - The latent variables can specify a 3D shape, letting us take advantage of existing renderers.
- Attend, Infer, Repeat: Fast Scene Understanding with Generative Models - The latent variables can be a list or set of vectors.
- Gradient Estimation Using Stochastic Computation Graphs Latent variables can be discrete, but this makes gradient estimation harder. Also see the original REINFORCE paper.
- Composing graphical models with neural networks for structured representations and fast inference - the prior on latent variables can be any tractable graphical model, and we can use this inference as part of the recognition step.
October 28th: Conditional generation

Generative models can be used to produce novel content such as images and text. This ability is especially useful when we can generate data conditioned on it having certain desired properties (for instance: generate an image that would be likely to have the caption "a horse on a beach").
November 4th: Model-based reinforcement learning

The main drawback of contemporary deep reinforcement learning methods is that they require a lot of interaction with the system to become effective. Using unsupervised learning, we can break the problem into two parts: 1) Modeling the dynamics of the system, and 2) Finding a good policy or plan given those dynamics.
- Human-level control through deep reinforcement learning - First, as a control, we'll review non-model-based approaches. Slides
- Deep Visual Foresight for Planning Robot Motion
- Structured Inference Networks for Nonlinear State Space Models slides
- PILCO: A Model-Based and Data-Efficient Approach to Policy Search - A fully model-based approach using GPs slides
- Data-Efficient Reinforcement Learning in Continuous-State POMDPs - an extension with a filter in the loop.
- Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model slides
November 11th: Latent-variable language models

We'll discuss other ways to produce continuous representations of discrete objects, such as text.
- Word2Vec slides
- Breaking Sticks and Ambiguities with Adaptive Skip-gram - word2vec with multiple meanings for each word.
- Skip-Thought Vectors - Could be described as sentence2vec. slides
- Neural Machine Translation in Linear Time
- A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues
- Program Synthesis for Character-Level Language Modeling
- LipNet: Sentence-level Lipreading slides
- Exploring the Limits of Language Modeling
November 18th: Project presentations
November 25th: Project presentations
December 2nd: Guest lecture by Roger Grosse
- Part 1: Building open-ended languages of models slides
- Part 2: Evaluating generative models slides
December 10th: Projects due