STA 4273 Winter 2021: Minimizing Expectations

Overview and Motivation

The COVID-19 pandemic has subsided, and you are going out to dinner for the first time in a year. You are trying to decide between two restaurants: your old favourite and a new popular one. Do you go with your trusted favourite or take a risk on the new restaurant? This is an example of decision-making under uncertainty. It is a problem that has been studied for a long time under various guises in many fields including statistics, economics, operations research, and computer science.

Decision-making under uncertainty is typically formalized as the problem of minimizing an expected cost (or maximizing an expected reward). The decision-maker takes an action by sampling from a distribution over actions, and it receives a cost for that action. The problem is to find the action distribution that minimizes the decision-maker's expected cost.

This problem may seem rather specific, but it appears throughout machine learning and statistics. The most prominent example is Bayesian inference, which can be cast in this paradigm as a variational optimization problem. More broadly, progress on optimizing expected values would improve generative models of real-world data, neural network models with calibrated uncertainties, reinforcement learning algorithms, and many other application areas.

This seminar course introduces students to the various methodological issues at stake in the problem of optimizing expected values and leads them in a discussion of its recent developments in machine learning. The course emphasizes the interplay between reinforcement learning and Bayesian inference. While most of the readings are applied or methodological, there are topics for more theoretically-minded students. Students will be expected to present a paper, prepare code notebooks, and complete a final project on a topic of their choice.

This course's structure is heavily inspired by Learning to Search by Prof. Duvenaud.

Course Information

Teaching Staff

Instructor: Chris Maddison
Instructor Office Hours: 4:00PM–5:00PM on Thursdays via GatherTown
TAs: Cait Harrigan, Farnam Mansouri
Email Instructor and TA: sta2473-min-expectations@cs.toronto.edu

Where and When

Thursdays 1:00PM–3:00PM via Zoom

Online Delivery

Class will be held synchronously online every week via Zoom. The lectures will be recorded for asynchronous viewing by enrolled students. All students are encouraged to attend class each week. Information on attending class, attending office hours, viewing recorded lectures, and using Piazza is available on Quercus.

Course videos and materials belong to your instructor, the University, and/or another source depending on the specific facts of each situation, and are protected by copyright. In this course you are permitted to download session videos and materials for your own academic use, but you should not copy, share, or use them for any other purpose without the explicit permission of the instructor. For questions about recording and use of videos in which you appear please contact your instructor.

Assignments and Grading

Assignments for the course include a paper presentation and a final project. The marking scheme is as follows:

25% – Paper presentation and code notebook (handout, example notebook)
15% – Project proposal, due Feb. 22 (handout)
60% – Project report, due April 14 (handout)

Course Syllabus and Policies

The course information and policies can be found in the syllabus.

Schedule

This is a preliminary schedule, and it may change throughout the term. With the exception of the first two weeks, each week students will be presenting a recent paper from the literature. Every student will present a paper once during the course.

The weeks are organized into themes and associated with a list of recent reference. No one is expected to read every paper on the list for each week, but there will be some recommended readings for the whole class.

#	Date	Topic	Notes
1	14/1	A common problem Expected values are routinely optimized in statistics and machine learning. We review the basic terminology in this area and discuss some major applications, incuding generative models, (approximate) Bayesian inference, reinforcement learning, and control.	Readings (Blei et al., 2018) Chap. 3 of (Sutton & Barto, 2018) Lecture [slides]
2	21/1	Basic tools Iterative methods are essential tools for optimization. We will introduce the basics of iterative methods, including stochastic gradient descent (SGD), value estimation, policy iteration. The question that we will pose throughout the course is: what structure in the problem is being exploited by the method?	SGD readings If you do not have time, read Chap. 2 & 3 of (Hazan, 2019) for a terse introduction (it is a working draft). If you have more time, read Chap. 3 & 4 of (Bottou et al., 2018), which is more thorough than what we will discuss in class. Dynamic programming readings Sect. 1.1 -1.4 of (Agarwal et al., 2020). Again this is a working draft, so it is terse. Still, it's an efficient introduction. I will describe the context in class. Lecture [slides] descent lemma
3	28/1	Gradient estimation I Gradient information is very useful for optimization, and computing gradients is a key subroutine of many optimization methods. We will review basic gradient estimation techniques, including policy gradients and reparameterization gradients. Student presentations will focus on recent extensions of these methods.	Readings (Mohamed et al., 2020) Student presentations on: (Schulman et al., 2016) [slides] [colab] (Paulus et al., 2020) (Kool et al., 2020) Lecture [slides]
4	4/2	Gradient estimation II Gradient estimation is sometimes desirable in more exotic settings. Student presentations will focus on gradient estimation for off-policy settings, for higher-order derivatives, or for implicit distributions.	Readings (Weber et al., 2019) Student presentations on: (Foerster et al., 2018) [slides] [colab] (Li & Turner, 2018) (Imani et al., 2018) [slides] [colab] Lecture [slides]
5	11/2	Variational objectives I Bayesian inference can be cast as a variational problem of minimizing an expectation. In recent years, this point of view has lead to a variety of useful loss functions for deep generative models and principled information-theoretic regularization. Student presentations will focus on recent developments in this subfield.	Readings (Kingma & Welling, 2019) Student presentations on: (Nielsen et al., 2020) [slides] [colab] (Poole et al., 2019) [slides] [colab] Lecture [slides]
6	25/2	Variational objectives II Student presentations will focus on applications to Bayesian neural networks, extensions to functional settings, and other developments. We will see how some of our efforts on gradient estimation can pay off.	Readings (Blundell et al., 2015) Student presentations on: (Blundell et al., 2015) [slides] (Sun et al., 2019) [slides] [colab] (Naesseth et al., 2020) [slides] [colab] Lecture [slides]
7	4/3	Policy optimization I For the next four classes, we will shift focus to reinforcement learning, but connections to Bayesian inference will be omnipresent. The standard setting for policy optimization assumes than an agent can collect data by interacting with an environment. Student presentations will focus on recent developments in the problem of (mostly) online policy optimization.	Readings (Ghosh et al., 2020) (Neu et al., 2017) Student presentations on: (Abdolmaleki et al., 2018) [slides] [colab] (Dadashi et al., 2019) [slides] [colab] (Geist et al., 2019) [slides] Lecture [slides]
8	11/3	Offline policy evaluation In modern applications of reinforcement learning it is important to be able to evaluate a policy without interacting with the environment. Student presentations will focus on recent developments in the problem of offline policy evaluation.	Readings (Levine et al., 2020) Student presentations on: (Liu et al., 2020) [slides] [colab] (Le Paine et al., 2020) [slides] [colab] Guest lecture: George Tucker
9	18/3	Policy optimization II Learning optimal behaviour without interacting with an environment is very challenging. Student presentations will focus on recent developments in the problem of (mostly) offline policy optimization.	Readings (Levine et al., 2020) Student presentations on: (Yu et al., 2020) [slides] [colab] (Czarnecki et al., 2019) [slides] [colab] (Nachum & Dai, 2020) [slides] [colab] Lecture [slides]
10	25/3	Search and policy optimization Monte Carlo Tree Search revolutionized game-playing AIs. Student presentations will focus on connections between search and policy optimization.	Readings (Browne et al., 2012) Student presentations on: (Xiao et al., 2019) [slides] [colab] (Grill et al., 2020) (Silver et al., 2018) [slides] [colab] Lecture [slides]
11	1/4	Inference and control I In the final two classes we will return to one of the central themes: the connections between control and inference.	Readings (Weber et al., 2015) (O'Donoghue et al., 2020) Student presentations on: (Teh et al., 2017) [slides] [colab] (Fellows et al., 2019) [slides] [colab] Lecture [slides]
12	8/4	Inference and control II In the final two classes we will return to one of the central themes: the connections between control and inference.	Readings (Weber et al., 2015) (O'Donoghue et al., 2020) Student presentations on: (Kim et al., 2020) [slides] Guest lecture: Brendan O'Donoghue

Recent references

This is a selected list of recent references relevant to this course, organized by the topic for each week. Broken or incorrect links are likely, please let me know if you find one. This is not an exhaustive list of references. If you're intrigued by a subtopic, you should start here and follow the citation graph to find more related literature.

A common problem and basic tools

General reference:

Stochastic Simulation: Algorithms and Analysis, Søren Asmussen and Peter W. Glynn.

Variational inference:

Graphical Models, Exponential Families, and Variational Inference, Martin J. Wainwright and Michael I. Jordan.
Variational Inference: A Review for Statisticians, David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe.

Reinforcement learning:

Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: Theory and Algorithms, Alekh Agarwal, Nan Jiang, Sham M. Kakade, and Wen Sun.
Dynamic Programming and Optimal Control, Dimitri P. Bertsekas.
Bayesian Reinforcement Learning: A Survey, Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar.
A Tutorial on Thompson Sampling, Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen.
Spinning Up in Deep RL, OpenAI.

Optimization:

Convex Optimization, Stephen Boyd and Lieven Vandenberghe.
Optimization Methods for Large-Scale Machine Learning, Léon Bottou, Frank E. Curtis, Jorge Nocedal.
Monte Carlo Gradient Estimation in Machine Learning Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih.

Gradient estimation I

Foundational work:

Likelihood ratio estimator (Glynn, 1990)
IPA (Glasserman & Cho, 1991)
REINFORCE (Williams, 1992)
Actor-critic with func. approx. (Sutton et al., 2000)
Variance Reduction (Greensmith et al., 2004)
Weak derivatives (Pflug, 2006)

Recent work on relaxed estimators:

Gumbel-Softmax (Jang et al., 2017)
Concrete Distribution (Maddison et al., 2017)
Invertible Gaussian (Potapczynski et al., 2020)
Stochastic Softmax Tricks (Paulus et al., 2020)

Some recent work on reparameterization gradients:

Generalized reparam. gradient (Ruiz et al., 2016)
Reparam. rejection sampling (Naesseth et al., 2017)
Non-diff. models (Lee et al., 2018)
Beyond reparam. (Jankowiak & Obermeyer, 2018)

Implicit reparam. gradients (Figurnov et al., 2018)
Stein and reparam. (Lin et al., 2019)
Correctness for non-diff. models (Lee et al., 2020)

Some recent work on online policy gradients:

NVIL (Mnih & Gregor, 2014)
DPG (Silver et al., 2014)
Local Expectations (Titsias & Lázaro–Gredilla, 2015)
MuProp (Gu et al., 2016)
GAE (Schulman et al., 2016)
A3C (Mnih et al., 2016)

REBAR (Tucker et al., 2017)
Q-Prop (Gu et al., 2017)
RELAX (Grathwohl et al., 2018)
Mirage of Action-Dep. Baselines (Tucker et al., 2018)
Unordered set estimator (Kool et al., 2020)

Gradient estimation II

Higher-order derivatives:

DiCE (Foerster et al., 2018)
Loaded DiCE (Foerster et al., 2019)

Implicit distributions:

Estimators for Implicit Models (Li & Turner, 2018)
A Spectral Approach (Shi et al., 2018)

Some recent work on offline policy gradients:

Off-policy Policy Gradient (Imani et al., 2018)
State Distribution Correction (Liu et al., 2019)
Gen. Off-Policy Actor-Critic (Zhang et al., 2019)

Generalized perspectives:

Stochastic Computation Graphs (Schulman et al., 2016)
Credit Assignment Techniques (Weber et al., 2019)

Variational objectives I

Some deep variational objectives:

VAE (Kingma & Welling, 2014)
DLGM (Rezende et al., 2014)
VampPrior (Tomczak & Welling, 2016)
Normalizing flows (Rezende & Mohamed, 2016)
SurVAE (Nielsen et al., 2020)
NVAE (Vahdat & Kautz, 2020)
Diffusion Models (Ho et al., 2020)
Provable smoothness guarantee (Domke, 2020)

Extended state space objectives:

Bridging the gap (Salimans et al., 2015)
IWAE (Burda et al., 2016)
FIVO (Maddison et al., 2017)
VSMC (Naesseth et al., 2017)
AESMC (Le et al., 2018)
HVAE (Caterini et al., 2018)
Divide and Couple (Domke & Sheldon, 2019)
IWHVI (Sobolev & Vetrov, 2019)
Revisiting (Lawson et al., 2019)

Gradient estimators for variational objectives:

VIMCO (Mnih & Rezende, 2016)
STL (Roeder et al., 2017)
Ensembling CVs (Geffner & Domke, 2017)
Tighter var. bounds (Rainforth et al., 2018)
DReG (Tucker et al., 2018)

Related variational objectives:

VIB (Alemi et al., 2017)
Variational bounds of MI (Poole et al., 2019)
DIB (Dubois et al., 2020)

Variational objectives II

Variational Bayesian neural networks:

Bayes with Backprop (Blundell et al., 2015)
Variational dropout (Kingma et al., 2015)
Dropout as Bayes (Gal & Ghahramani, 2016)
Risk vs. uncertainty (Osband, 2016)
Noisy Natural Gradient as VI (Zhang et al., 2018)

Exotic state spaces:

KLs & stochastic proc. (Matthews et al., 2016)
VGP (Tran et al., 2016)
SB–VAE (Nalisnick & Smyth, 2017)
Functional BNNs (Sun et al., 2019)
Functional NPs (Louizos et al., 2019)

Advanced training methods, other objectives:

Rényi Divergence VI (Li & Turner, 2016)
SVGD (Liu & Wang, 2016)
OPVI (Ranganath et al., 2016)
BVI (Guo et al., 2017)

SA–VAE (Kim et al., 2018)
REM (Dieng & Paisley, 2019)
TVO (Masrani et al., 2019)
MSC (Naesseth et al., 2020)

Offline policy evaluation

Importance Sampling:

Off-policy Policy Evaluation (Precup et al., 2000)
Doubly Robust OPE (Jiang & Li, 2016)
MAGIC (Thomas & Brunskill, 2016)
SWITCH (Wang et al., 2017)
Curse of Horizon (Liu et al., 2020)

Distribution Correction Estimation:

DualDICE (Nachum et al., 2019)
GenDICE (Zhang et al., 2020)
GradientDICE (Zhang et al., 2020)
CoinDICE (Dai et al., 2020)

Fitted Q-evaluation:

FQE (Le et al., 2019)
Hyperparameter Selection (Le Paine et al., 2020)

Policy optimization I

KL–regularized descent:

NPG (Kakade, 2001)
REPS (Peters et al., 2010)
TRPO (Schulman et al., 2015)
PPO (Schulman et al., 2017)
ACKTR (Wu et al., 2017)
MPO (Abdolmaleki et al., 2018)
V-MPO (Song et al., 2020)

KL–regularized RL:

A3C (Mnih et al., 2016)
PG & Q-learning (O'Donoghue et al., 2017)
RL with energy-based policies (Haarnoja et al., 2017)
SAC (Haarnoja et al., 2018)
PG & Soft Q-learning (Schulman et al., 2018)
DIAYN (Eysenbach et al., 2019)

Other views:

Value function polytope (Dadashi et al., 2019)
An operator view (Ghosh et al., 2020)
MDPO (Tomar et al., 2020)

Policy optimization II

Imitation learning:

DAGGER (Ross et al., 2011)
Actor-mimic (Parisoto et al., 2016)
Policy distillation (Rusu et al., 2016)
Mix&Match (Czarnecki et al., 2018)
Distilling PD (Czarnecki et al., 2019)

Batch RL:

BCQ (Fujimoto et al., 2018)
BEAR (Kumar et al., 2019)
BRAC (Wu et al., 2019)
MOPO (Yu et al., 2020)

Other views:

RL via Fenchel Duality (Nachum & Dai, 2020)

Search and policy optimization

Monte Carlo Tree Search:

MC Planning (Kocsis & Szepesvári, 2006)
A survey (Browne et al., 2012)
Optimism and MCTS (Munos, 2014)
AlphaGo (Silver et al., 2016)
AlphaZero (Silver et al., 2018)
MuZero (Schrittwieser et al., 2020)

Search and policy optimization:

Expert Iteration (Anthony et al., 2017)
PGS (Anthony et al., 2019)
MCTSPO (Ma et al., 2019)
MCTS as policy optimization (Grill et al., 2020)
DirPG (Lorberbom et al., 2020)
The role of planning (Hamrick et al., 2020)

Regularized MCTS:

MENTS (Xiao et al., 2019)
Convex reg. and MCTS

Inference and control

Bayesian RL:

Bayes-Adaptive RL with search (Guez et al., 2013)
Bayes RL Survey (Ghavamzadeh et al., 2015)
Posterior vs. optimism (Osband & Van Roy, 2017)
Tutorial on Thompson Sampling (Russo et al., 2018)

Bayesian RL with regret bounds (O'Donoghue, 2018)
Randomized value functions (Osband et al., 2019)
Making Sense (O'Donoghue et al., 2020)

RL as inference:

Control under uncertainty (Runggaldier, 1998)
Inference for MDPs (Toussaint & Storkey, 2006)
Computation of optimal actions (Todorov, 2009)
Control as inference (Kappen et al., 2012)
RL by approximate inference (Rawlik et al., 2013)
Adaptive IS for control (Kappen & Ruiz, 2015)

Distral (Teh et al., 2017)
A Survey (Levine, 2018)
VIREL (Fellows et al., 2019)
Analysis of KL Reg. (Vieillard et al., 2020)
SLAC (Lee et al., 2020)

Inference as control:

Systematic Stochastic Search (Mansinghka et al., 2009)
A* Sampling (Maddison et al., 2014)
Reinforced VI (Weber et al., 2015)
Softstar (Monfort et al., 2015)
BBVI with trust-region optimization (Regier et al., 2017)
Gradient-Free VI using Policy Search (Arenz et al., 2018)
Controlled SMC (Heng et al., 2019)
Trust Region Sequential VI (Kim et al., 2019)
Approximate Inference with MCTS (Buesing et al., 2020)
VI with Future Likelihood Estimates (Kim et al., 2020)