David Kristjanson Duvenaud

I'm an assistant professor at the University of Toronto, in both Computer Science and Statistics.

Previously, I was a postdoc in the Harvard Intelligent Probabilistic Systems group, working with Ryan Adams on optimization, synthetic chemistry, Bayesian inference, and neural networks. I did my Ph.D. at the University of Cambridge, where my advisors were Carl Rasmussen and Zoubin Ghahramani. My M.Sc. advisor was Kevin Murphy at the University of British Columbia, where I worked mostly on machine vision. I spent a summer at the Max Planck Institute for Intelligent Systems, and the two summers before that at Google Research, doing machine vision. I co-founded Invenia, an energy forecasting and trading firm where I still consult. I'm also a founding member of the Vector Institute.

Curriculum Vitae

E-mail: duvenaud@cs.toronto.edu



Link to hyperOpt2017 paper Stochastic Hyperparameter Optimization Through Hypernetworks

Models are usually tuned by nesting optimization of model weights inside the optimization of hyperparameters. We collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our method trains a neural net to output approximately optimal weights as a function of hyperparameters. This method converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.

Jonathan Lorraine, David Duvenaud
arxiv | bibtex | slides | code
Isolating Souces of Disentanglement in Variational Autoencoders

We decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables. We use this to motivate our β-TCVAE (Total Correlation Variational Autoencoder), a refinement of the state-of-the-art β-VAE objective for learning disentangled representations, requiring no additional hyperparameters during training. We further propose a principled classifier-free measure of disentanglement called the mutual information gap (MIG). We perform extensive quantitative and qualitative experiments, in both restricted and non-restricted settings, and show a strong relation between total correlation and disentanglement, when the latent variables model is trained using our framework.

Tian Qi Chen, Xuechen Li, Roger Grosse, David Duvenaud
Noisy Natural Gradient as Variational Inference

Bayesian neural nets combine the flexibility of deep learning with uncertainty estimation, but are usually approximated using a fully-factorized Guassian. We show that natural gradient ascent with adaptive weight noise implicitly fits a variational Gassuain posterior. This insight allows us to train full-covariance, fully factorized, or matrix-variate Gaussian variational posteriors using noisy versions of natural gradient, Adam, and K-FAC, respectively, allowing us to scale to modern-size convnets. Our noisy K-FAC algorithm makes better predictions and has better-calibrated uncertainty than existing methods. This leads to more efficient exploration in active learning and reinforcement learning.

Guodong Zhang, Shengyang Sun, David Duvenaud, Roger Grosse

Selected papers

link to pdf Backpropagation through the Void: Optimizing control variates for black-box gradient estimation

We learn low-variance, unbiased gradient estimators for any function of random variables. We backprop through a neural net surrogate of the original function, which is optimized to minimize gradient variance during the optimization of the original objective. We train discrete latent-variable models, and do continuous and discrete reinforcement learning with an adaptive, action-conditional baseline.

Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, David Duvenaud
International Conference on Learning Representations, 2018
arxiv | code | slides | bibtex
link to pdf Automatic chemical design using a data-driven continuous representation of molecules

We develop a molecular autoencoder, which converts discrete representations of molecules to and from a continuous representation. This allows gradient-based optimization through the space of chemical compounds. Continuous representations also let us generate novel chemicals by interpolating between molecules.

Rafa Gómez-Bombarelli, Jennifer Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Sheberla, Dennis, Jorge Aguilera-Iparraguirre, Timothy Hirzel, Ryan P. Adams, Alán Aspuru-Guzik
American Chemical Society Central Science, 2018
arxiv | bibtex | slides | code
link to pdf Sticking the landing: Simple, lower-variance gradient estimators for variational inference

We give a simple recipe for reducing the variance of the gradient of the variational evidence lower bound. The entire trick is just removing one term from the gradient. Removing this term leaves an unbiased gradient estimator whose variance approaches zero as the approximate posterior approaches the exact posterior. We also generalize this trick to mixtures and importance-weighted posteriors.

Geoff Roeder, Yuhuai Wu, David Duvenaud
Neural Information Processing Systems, 2017
arxiv | bibtex | code
link to pdf Reinterpreting importance-weighted autoencoders

The standard interpretation of importance-weighted autoencoders is that they maximize a tighter, multi-sample lower bound than the standard evidence lower bound. We give an alternate interpretation: it optimizes the standard lower bound, but using a more complex distribution, which we show how to visualize.

Chris Cremer, Quaid Morris, David Duvenaud
ICLR Workshop track, 2017
arxiv | bibtex
link to pdf Composing graphical models with neural networks for structured representations and fast inference

We propose a general modeling and inference framework that combines the complementary strengths of probabilistic graphical models and deep learning methods. Our model family composes latent graphical models with neural network observation likelihoods. All components are trained simultaneously. We use this framework to automatically segment and categorize mouse behavior from raw depth video.

Matthew Johnson, David Duvenaud, Alex Wiltschko, Bob Datta, Ryan P. Adams
Neural Information Processing Systems, 2016
preprint | video | code | slides | bibtex | animation
link to pdf Probing the Compositionality of Intuitive Functions

How do people learn about complex functional structure? We propose that humans use compositionality: complex structure is decomposed into simpler building blocks. We formalize this idea using a grammar over Gaussian process kernels. We show that people prefer compositional extrapolations, and argue that this is consistent with broad principles of human cognition.

Eric Shulz, Joshua B. Tenenbaum, David Duvenaud, Maarten Speekenbrink, Samuel J. Gershman
Neural Information Processing Systems, 2016
preprint | video | bibtex
link to pdf Black-box stochastic variational inference in five lines of Python

We emphasize how easy it is to construct scalable inference methods using only automatic differentiation. We present code that computes stochastic gradients of the evidence lower bound for any differentiable posterior. For example, we do stochastic variational inference in a deep Bayesian neural network.

David Duvenaud, Ryan P. Adams
NIPS Workshop on Black-box Learning and Inference, 2015
preprint | code | bibtex | video
link to pdf Autograd: Reverse-mode differentiation of native Python

Autograd automatically differentiates native Python and Numpy code. It can handle loops, ifs, recursion and closures, and it can even take derivatives of its own derivatives. It uses reverse-mode differentiation (a.k.a. backpropagation), which means it's efficient for gradient-based optimization. Check out the tutorial and the examples directory.

Dougal Maclaurin, David Duvenaud, Matthew Johnson
code | bibtex | slides
link to pdf Early Stopping as Nonparametric Variational Inference

Stochastic gradient descent samples from a nonparametric distribution, implicitly defined by the transformation of the initial distribution by an optimizer. We track the loss of entropy during optimization to get a scalable estimate of the marginal likelihood. This Bayesian interpretation of SGD gives a theoretical foundation for popular tricks such as early stopping and ensembling. We evaluate our marginal likelihood estimator on neural network models.

David Duvenaud, Dougal Maclaurin, Ryan P. Adams
Artificial Intelligence and Statistics, 2016
preprint | slides | code | bibtex
link to pdf Convolutional Networks on Graphs for Learning Molecular Fingerprints

We introduce a convolutional neural network that operates directly on graphs, allowing end-to-end learning of the entire feature pipeline. This architecture generalizes standard molecular fingerprints. These data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Related work led to our Nature Materials paper.

David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafa Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, Ryan P. Adams
Neural Information Processing Systems, 2015
pdf | slides | code | bibtex
link to pdf Gradient-based Hyperparameter Optimization through Reversible Learning

Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of the validation loss with respect to all hyperparameters by differentiating through the entire training procedure. This lets us optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural net architectures.

Dougal Maclaurin, David Duvenaud, Ryan P. Adams
International Conference on Machine Learning, 2015
pdf | slides | code | bibtex
link to pdf Probabilistic ODE Solvers with Runge-Kutta Means

We show that some standard differential equation solvers are equivalent to Gaussian process predictive means, giving them a natural way to handle uncertainty. This work is part of the larger probabilistic numerics research agenda, which interprets numerical algorithms as inference procedures so they can be better understood and extended.

Michael Schober, David Duvenaud, Philipp Hennig
Neural Information Processing Systems, 2014. Oral presentation.
pdf | slides | bibtex
Testing MCMC Code

When can we trust our experiments? We've collected some simple sanity checks that catch a wide class of bugs. Related: Richard Mann wrote a gripping blog post about the aftermath of finding a subtle bug in one of his landmark papers.

Roger Grosse, David Duvenaud
NIPS Workshop on Software Engineering for Machine Learning, 2014
link to pdf PhD Thesis: Automatic Model Construction with Gaussian Processes pdf | code | bibtex | kernel cookbook
link to pdf Automatic Construction and Natural-Language Description of Nonparametric Regression Models

We wrote a program which automatically writes reports summarizing automatically constructed models. A prototype for the automatic statistician project.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, Zoubin Ghahramani
Association for the Advancement of Artificial Intelligence (AAAI), 2014
pdf | code | slides | example report - airline | example report - solar | more examples | bibtex
link to pdf Avoiding Pathologies in Very Deep Networks

To suggest better neural network architectures, we analyze the properties different priors on compositions of functions. We study deep Gaussian processes, a type of infinitely-wide, deep neural net. We also examine infinitely deep covariance functions. Finally, we show that you get additive covariance if you do dropout on Gaussian processes.

David Duvenaud, Oren Rippel, Ryan P. Adams, Zoubin Ghahramani
Artificial Intelligence and Statistics, 2014
pdf | code | slides | video of 50-layer warping | video of 50-layer density | bibtex
link to pdf Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces

To optimize the overall architecture of a neural network along with its hyperparameters, we must be able to relate the performance of nets having differing numbers of hyperparameters. To address this problem, we define a new kernel for conditional parameter spaces that explicitly includes information about which parameters are relevant in a given structure.

Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, Michael Osborne
NIPS workshop on Bayesian optimization, 2013
pdf | code | bibtex
link to pdf Warped Mixtures for Nonparametric Cluster Shapes

If you fit a mixture of Gaussians to a single cluster that is curved or heavy-tailed, your model will report that the data contains many clusters! To fix this problem, we warp a latent mixture of Gaussians into nonparametric cluster shapes. The low-dimensional latent mixture model summarizes the properties of the high-dimensional density manifolds describing the data.

Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani
Uncertainty in Artificial Intelligence, 2013
pdf | code | slides | talk | bibtex
link to arxiv Structure Discovery in Nonparametric Regression through Compositional Kernel Search

How could an AI do statistics? To search through an open-ended class of structured, nonparametric regression models, we introduce a simple grammar which specifies composite kernels. These structured models often allow an interpretable decomposition of the function being modeled, as well as long-range extrapolation. Many common regression methods are special cases of this large family of models.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, Zoubin Ghahramani
International Conference on Machine Learning, 2013
pdf | code | short slides | long slides | bibtex
link to video HarlMCMC Shake

Two short animations illustrate the differences between a Metropolis-Hastings (MH) sampler and a Hamiltonian Monte Carlo (HMC) sampler, to the tune of the Harlem shake. This inspired several followup videos - benchmark your MCMC algorithm on these distributions!

by Tamara Broderick and David Duvenaud
youtube | code
link to pdf Active Learning of Model Evidence using Bayesian Quadrature

Instead of the usual Monte-Carlo based methods for computing integrals of likelihood functions, we instead construct a surrogate model of the likelihood function, and infer its integral conditioned on a set of evaluations. This allows us to evaluate the likelihood wherever is most informative, instead of running a Markov chain. This means fewer evaluations to estimate integrals.

Michael Osborne, David Duvenaud, Roman Garnett, Carl Rasmussen, Stephen Roberts, Zoubin Ghahramani
Neural Information Processing Systems, 2012
pdf | code | slides | related talk | bibtex
link to pdf Optimally-Weighted Herding is Bayesian Quadrature

We prove several connections between a numerical integration method that minimizes a worst-case bound (herding), and a model-based way of estimating integrals (Bayesian quadrature). It turns out that both optimize the same criterion, and that Bayesian Quadrature does this optimally.

Ferenc Huszár and David Duvenaud
Uncertainty in Artificial Intelligence, 2012. Oral presentation.
pdf | code | slides | talk | bibtex
link to pdf Additive Gaussian Processes

When functions have additive structure, we can extrapolate further than with standard Gaussian process models. We show how to efficiently integrate over exponentially-many ways of modeling a function as a sum of low-dimensional functions.

David Duvenaud, Hannes Nickisch, Carl Rasmussen
Neural Information Processing Systems, 2011
pdf | code | slides | bibtex
link to pdf Multiscale Conditional Random Fields for Semi-supervised Labeling and Classification

How can we take advantage of images labeled only by what objects they contain? By combining information across different scales, we use image-level labels (such as this image contains a cat) to infer what different classes of objects look like at the pixel-level, and where they occur in images. This work formed my M.Sc. thesis at UBC.

David Duvenaud, Benjamin Marlin, Kevin Murphy
Canadian Conference on Computer and Robot Vision, 2011
pdf | code | slides | bibtex