David Kristjanson Duvenaud

Publications | Videos | Misc | Talks

I'm an assistant professor at the University of Toronto, in both Computer Science and Statistics.

Previously, I was a postdoc in the Harvard Intelligent Probabilistic Systems group, working with Ryan Adams on model-based optimization, synthetic chemistry, Bayesian numerics, and neural networks. I did my Ph.D. at the University of Cambridge, where my advisors were Carl Rasmussen and Zoubin Ghahramani. My M.Sc. advisor was Kevin Murphy at the University of British Columbia, where I worked mostly on machine vision. I spent a summer at the Max Planck Institute for Intelligent Systems, and the two summers before that at Google Research, doing machine vision. I co-founded Invenia, an energy forecasting and trading firm where I still consult.

Curriculum Vitae

E-mail: duvenaud@cs.toronto.edu



link to pdf Automatic chemical design using a data-driven continuous representation of molecules

We develop a molecular autoencoder, which converts discrete representations of molecules to and from a vector representation. This allows efficient gradient-based optimization through open-ended spaces of chemical compounds. Continuous representations also allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as interpolating between molecules.

Rafa Gómez-Bombarelli, David Duvenaud, José Miguel Hernández-Lobato, Jorge Aguilera-Iparraguirre, Timothy Hirzel, Ryan P. Adams, Alán Aspuru-Guzik
Arxiv, 2016
preprint | bibtex | slides | code

Selected papers

link to pdf Composing graphical models with neural networks for structured representations and fast inference

How to combine the complementary strengths of probabilistic graphical models and neural networks? We compose latent graphical models with neural network observation likelihoods. For inference, we use recognition networks to produce local evidence potentials, then combine them using efficient message-passing algorithms. All components are trained simultaneously with a single stochastic variational inference objective. We use this framework to automatically segment and categorize mouse behavior from raw depth video.

Matthew Johnson, David Duvenaud, Alex Wiltschko, Bob Datta, Ryan P. Adams
Neural Information Processing Systems, 2016
preprint | video | code | slides | bibtex | animation
link to pdf Probing the Compositionality of Intuitive Functions

How do people learn about complex functional structure? We propose that this is accomplished by harnessing compositionality: complex structure is decomposed into simpler building blocks. We formalize this idea within the framework of Bayesian regression using a grammar over Gaussian process kernels. We show that participants prefer compositional over non-compositional extrapolations and samples. We argue that the compositional nature of intuitive functions is consistent with broad principles of human cognition.

Eric Shulz, Joshua B. Tenenbaum, David Duvenaud, Maarten Speekenbrink, Samuel J. Gershman
Neural Information Processing Systems, 2016
preprint | video | bibtex
link to pdf Black-box stochastic variational inference in five lines of Python

We emphasize how easy it is to construct scalable inference methods using only automatic differentiation. We present code that computes stochastic gradients of the evidence lower bound for any differentiable posterior. For example, we do stochastic variational inference in a deep Bayesian neural network.

David Duvenaud, Ryan P. Adams
NIPS Workshop on Black-box Learning and Inference, 2015
preprint | code | bibtex | video
link to pdf Autograd: Reverse-mode differentiation of native Python

Autograd automatically differentiates native Python and Numpy code. It can handle loops, ifs, recursion and closures, and it can even take derivatives of its own derivatives. It uses reverse-mode differentiation (a.k.a. backpropagation), which means it's efficient for gradient-based optimization. Check out the tutorial and the examples directory.

Dougal Maclaurin, David Duvenaud, Matthew Johnson
code | bibtex
link to pdf Early Stopping as Nonparametric Variational Inference

Stochastic gradient descent samples from a nonparametric distribution, implicitly defined by the transformation of an initial distribution by a sequence of optimization updates. We track the change in entropy during optimization to get a scalable approximate lower bound on the marginal likelihood. This Bayesian interpretation of SGD gives a theoretical foundation for popular tricks such as early stopping and ensembling. We evaluate our marginal likelihood estimator on neural network models.

David Duvenaud, Dougal Maclaurin, Ryan P. Adams
Artificial Intelligence and Statistics, 2016
preprint | slides | code | bibtex
link to pdf Convolutional Networks on Graphs for Learning Molecular Fingerprints

We introduce a convolutional neural network that operates directly on graphs, allowing end-to-end learning of the entire feature pipeline. This architecture generalizes standard molecular fingerprints. These data-driven features are more interpretable, and have better predictive performance on a variety of tasks.

David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafa Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, Ryan P. Adams
Neural Information Processing Systems, 2015
pdf | slides | code | bibtex
link to pdf Gradient-based Hyperparameter Optimization through Reversible Learning

Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of the validation loss with respect to all hyperparameters by differentiating through the entire training procedure. This lets us optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural net architectures.

Dougal Maclaurin, David Duvenaud, Ryan P. Adams
International Conference on Machine Learning, 2015
pdf | slides | code | bibtex
link to pdf Probabilistic ODE Solvers with Runge-Kutta Means

We show that some standard differential equation solvers are equivalent to Gaussian process predictive means, giving them a natural way to handle uncertainty. This work is part of the larger probabilistic numerics research agenda, which interprets numerical algorithms as inference procedures so they can be better understood and extended.

Michael Schober, David Duvenaud, Philipp Hennig
Neural Information Processing Systems, 2014. Oral presentation.
pdf | slides | bibtex
link to pdf PhD Thesis: Automatic Model Construction with Gaussian Processes pdf | code | bibtex
link to pdf Automatic Construction and Natural-Language Description of Nonparametric Regression Models

We wrote a program which automatically writes reports summarizing automatically constructed models. A prototype for the automatic statistician project.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, Zoubin Ghahramani
Association for the Advancement of Artificial Intelligence (AAAI), 2014
pdf | code | slides | example report - airline | example report - solar | more examples | bibtex
link to pdf Avoiding Pathologies in Very Deep Networks

To suggest better neural network architectures, we analyze the properties different priors on compositions of functions. We study deep Gaussian processes, a type of infinitely-wide, deep neural net. We also examine infinitely deep covariance functions. Finally, we show that you get additive covariance if you do dropout on Gaussian processes.

David Duvenaud, Oren Rippel, Ryan P. Adams, Zoubin Ghahramani
Artificial Intelligence and Statistics, 2014
pdf | code | slides | video of 50-layer warping | bibtex
link to pdf Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces

To optimize the overall architecture of a neural network along with its hyperparameters, we must be able to relate the performance of nets having differing numbers of hyperparameters. To address this problem, we define a new kernel for conditional parameter spaces that explicitly includes information about which parameters are relevant in a given structure.

Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, Michael Osborne
NIPS workshop on Bayesian optimization, 2013
pdf | code | bibtex
link to pdf Warped Mixtures for Nonparametric Cluster Shapes

If you fit a mixture of Gaussians to a single cluster that is curved or heavy-tailed, your model will report that the data contains many clusters! To fix this problem, we warp a latent mixture of Gaussians into nonparametric cluster shapes. The low-dimensional latent mixture model summarizes the properties of the high-dimensional density manifolds describing the data.

Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani
Uncertainty in Artificial Intelligence, 2013
pdf | code | slides | talk | bibtex
link to arxiv Structure Discovery in Nonparametric Regression through Compositional Kernel Search

How could an AI do statistics? To search through an open-ended class of structured, nonparametric regression models, we introduce a simple grammar which specifies composite kernels. These structured models often allow an interpretable decomposition of the function being modeled, as well as long-range extrapolation. Many common regression methods are special cases of this large family of models.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, Zoubin Ghahramani
International Conference on Machine Learning, 2013
pdf | code | short slides | long slides | bibtex
link to pdf Active Learning of Model Evidence using Bayesian Quadrature

Instead of the usual Monte-Carlo based methods for computing integrals of likelihood functions, we instead construct a surrogate model of the likelihood function, and infer its integral conditioned on a set of evaluations. This allows us to evaluate the likelihood wherever is most informative, instead of running a Markov chain. This means fewer evaluations to estimate integrals.

Michael Osborne, David Duvenaud, Roman Garnett, Carl Rasmussen, Stephen Roberts, Zoubin Ghahramani
Neural Information Processing Systems, 2012
pdf | code | slides | related talk | bibtex
link to pdf Optimally-Weighted Herding is Bayesian Quadrature

We prove several connections between a numerical integration method that minimizes a worst-case bound (herding), and a model-based way of estimating integrals (Bayesian quadrature). It turns out that both optimize the same criterion, and that Bayesian Quadrature does this optimally.

Ferenc Huszár and David Duvenaud
Uncertainty in Artificial Intelligence, 2012. Oral presentation.
pdf | code | slides | talk | bibtex
link to pdf Additive Gaussian Processes

When functions have additive structure, we can extrapolate further than with standard Gaussian process models. We show how to efficiently integrate over exponentially-many ways of modeling a function as a sum of low-dimensional functions.

David Duvenaud, Hannes Nickisch, Carl Rasmussen
Neural Information Processing Systems, 2011
pdf | code | slides | bibtex
link to pdf Multiscale Conditional Random Fields for Semi-supervised Labeling and Classification

How can we take advantage of images labeled only by what objects they contain? By combining information across different scales, we use image-level labels (such as this image contains a cat) to infer what different classes of objects look like at the pixel-level, and where they occur in images. This work formed my M.Sc. thesis at UBC.

David Duvenaud, Benjamin Marlin, Kevin Murphy
Canadian Conference on Computer and Robot Vision, 2011
pdf | code | slides | bibtex
link to pdf Causal Learning without DAGs

When predicting the results of new actions, it's sometimes better to simply average over flexible conditional models than to attempt to identify a single causal structure as embodied by a directed acyclic graph (DAG).

David Duvenaud, Daniel Eaton, Kevin Murphy, Mark Schmidt
Journal of Machine Learning Research, W&CP, 2010
pdf | code | slides | poster | bibtex


link to video Visualizing draws from a deep Guassian process

By viewing deep networks as a prior on functions, we can ask which architectures give rise to which sorts of mappings.

Here we visualize a mapping drawn from a deep Gaussian process, using the input-connected architecture described in this paper.

mapping video | density video | code
link to video Machine Learning to Drive

Andrew McHutchon and Carl Rasmussen are working on a model-based reinforcement learning system that can learn from small amounts of experience. For fun, we hooked up a 3D physics engine to the learning system, and tried to get it to learn to drive a simple two-wheel car in a certain direction, starting with no knowledge of the dynamics. It only took about 10 seconds of practice to solve the problem, although not in real-time. Details are in the video description.

by Andrew McHutchon and David Duvenaud
youtube | related paper
link to video HarlMCMC Shake

Two short animations illustrate the differences between a Metropolis-Hastings (MH) sampler and a Hamiltonian Monte Carlo (HMC) sampler, to the tune of the Harlem shake. This inspired several followup videos - benchmark your MCMC algorithm on these distributions!

by Tamara Broderick and David Duvenaud
youtube | code
link to video Evolution of Locomotion

A fun project from undergrad: using the genetic algorithm (a terrible algorithm!) to learn locomotion strategies. The plan was for the population to learn to walk, but instead they evolved falling, rolling and shaking strategies. Eventually they exploited numerical problems in the physics engine to achieve arbitrarily high fitness, without ever having learned to walk!



link to cookbook Kernel Cookbook

Have you ever wondered which kernel to use for Gaussian process regression? This tutorial goes through the basic properties of functions that you can express by choosing or combining kernels, along with lots of examples.



link to slides Fast Random Feature Expansions

The Johnson-Lindenstrauss lemma states that you can randomly project a collection of data points into a lower dimensional space while mostly preserving pairwise distances. Recent developments have gone even further: non-linear randomised projections can be used to approximate kernel machines and scale them to datasets with millions of features and samples. In this talk we explore the theoretical aspects of the random projection method, and demonstrate its effectiveness on nonlinear regression problems. We also motivate and describe the recent Fastfood method.

With David Lopez-Paz, November 2013
slides | video | code
Sanity Checks

When can we trust our experiments? We've collected some simple sanity checks that catch a wide class of bugs. Roger Grosse and I also wrote a short tutorial on testing MCMC code. Related: Richard Mann wrote a gripping blog post about the aftermath of finding a subtle bug in one of his landmark papers.

Tea talk, April 2012
slides | paper