# David Kristjanson Duvenaud

I'm an assistant professor at the University of Toronto, in both Computer Science and Statistics.

Previously, I was a postdoc in the Harvard Intelligent Probabilistic Systems group, working with Ryan Adams on optimization, synthetic chemistry, Bayesian inference, and neural networks. I did my Ph.D. at the University of Cambridge, where my advisors were Carl Rasmussen and Zoubin Ghahramani. My M.Sc. advisor was Kevin Murphy at the University of British Columbia, where I worked mostly on machine vision. I spent a summer at the Max Planck Institute for Intelligent Systems, and the two summers before that at Google Research, doing machine vision. I co-founded Invenia, an energy forecasting and trading firm where I still consult. I'm also a founding member of the Vector Institute.## Preprints

Backpropagation through the Void: Optimizing control variates for black-box gradient estimation
We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our estimator is based on gradients of a control variate that's jointly optimized with the original parameters to minimize variance. These estimators are applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models, and also give unbiased, adaptive, action-conditional analog of the advantage actor-critic reinforcement learning algorithm. Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, David Duvenaudarxiv | code | slides | bibtex | |

Automatic chemical design using a data-driven continuous representation of molecules
We develop a molecular autoencoder, which converts discrete representations of molecules to and from a continuous representation. This allows gradient-based optimization through the space of chemical compounds. Continuous representations also let us generate novel chemicals by interpolating between molecules. Rafa Gómez-Bombarelli, David Duvenaud, José Miguel Hernández-Lobato, Jorge Aguilera-Iparraguirre, Timothy Hirzel, Ryan P. Adams, Alán Aspuru-Guzikarxiv | bibtex | slides | code |

## Selected papers

Sticking the landing: Simple, lower-variance gradient estimators for variational inference
We give a simple recipe for reducing the variance of the gradient of the variational evidence lower bound.
The entire trick is just removing one term from the gradient.
Removing this term leaves an unbiased gradient estimator whose variance approaches zero as the approximate posterior approaches the exact posterior.
We also generalize this trick to mixtures and importance-weighted posteriors.
To appear in Neural Information Processing Systems, 2017arxiv | bibtex | code | |

Reinterpreting importance-weighted autoencoders
The standard interpretation of importance-weighted autoencoders is that they maximize a tighter, multi-sample lower bound than the standard evidence lower bound. We give an alternate interpretation: it optimizes the standard lower bound, but using a more complex distribution, which we show how to visualize. Chris Cremer, Quaid Morris, David DuvenaudICLR Workshop track, 2017arxiv | bibtex | |

Composing graphical models with neural networks for structured representations and fast inference
We propose a general modeling and inference framework that combines the complementary strengths of probabilistic graphical models and deep learning methods. Our model family composes latent graphical models with neural network observation likelihoods. All components are trained simultaneously. We use this framework to automatically segment and categorize mouse behavior from raw depth video. Matthew Johnson, David Duvenaud, Alex Wiltschko, Bob Datta, Ryan P. AdamsNeural Information Processing Systems, 2016preprint | video | code | slides | bibtex | animation | |

Probing the Compositionality of Intuitive Functions
How do people learn about complex functional structure? We propose that humans use compositionality: complex structure is decomposed into simpler building blocks. We formalize this idea using a grammar over Gaussian process kernels. We show that people prefer compositional extrapolations, and argue that this is consistent with broad principles of human cognition. Eric Shulz, Joshua B. Tenenbaum, David Duvenaud, Maarten Speekenbrink, Samuel J. GershmanNeural Information Processing Systems, 2016preprint | video | bibtex | |

Black-box stochastic variational inference in five lines of Python
We emphasize how easy it is to construct scalable inference methods using only automatic differentiation. We present code that computes stochastic gradients of the evidence lower bound for any differentiable posterior. For example, we do stochastic variational inference in a deep Bayesian neural network. David Duvenaud, Ryan P. AdamsNIPS Workshop on Black-box Learning and Inference, 2015preprint | code | bibtex | video | |

Autograd: Reverse-mode differentiation of native Python
Autograd automatically differentiates native Python and Numpy code. It can handle loops, ifs, recursion and closures, and it can even take derivatives of its own derivatives. It uses reverse-mode differentiation (a.k.a. backpropagation), which means it's efficient for gradient-based optimization. Check out the tutorial and the examples directory. Dougal Maclaurin, David Duvenaud, Matthew Johnsoncode | bibtex | slides | |

Early Stopping as Nonparametric Variational Inference
Stochastic gradient descent samples from a nonparametric distribution, implicitly defined by the transformation of the initial distribution by an optimizer. We track the loss of entropy during optimization to get a scalable estimate of the marginal likelihood. This Bayesian interpretation of SGD gives a theoretical foundation for popular tricks such as early stopping and ensembling. We evaluate our marginal likelihood estimator on neural network models. David Duvenaud, Dougal Maclaurin, Ryan P. AdamsArtificial Intelligence and Statistics, 2016preprint | slides | code | bibtex | |

Convolutional Networks on Graphs for Learning Molecular Fingerprints
We introduce a convolutional neural network that operates directly on graphs, allowing end-to-end learning of the entire feature pipeline. This architecture generalizes standard molecular fingerprints. These data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Related work led to our Nature Materials paper. David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafa Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, Ryan P. AdamsNeural Information Processing Systems, 2015pdf | slides | code | bibtex | |

Gradient-based Hyperparameter Optimization through Reversible Learning
Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable.
We compute exact gradients of the validation loss with respect to all hyperparameters by differentiating through the International Conference on Machine Learning, 2015pdf | slides | code | bibtex | |

Probabilistic ODE Solvers with Runge-Kutta Means
We show that some standard differential equation solvers are equivalent to Gaussian process predictive means, giving them a natural way to handle uncertainty. This work is part of the larger probabilistic numerics research agenda, which interprets numerical algorithms as inference procedures so they can be better understood and extended. Michael Schober, David Duvenaud, Philipp HennigNeural Information Processing Systems, 2014. Oral presentation.
pdf | slides | bibtex | |

Testing MCMC CodeWhen can we trust our experiments? We've collected some simple sanity checks that catch a wide class of bugs. Related: Richard Mann wrote a gripping blog post about the aftermath of finding a subtle bug in one of his landmark papers. Roger Grosse, David DuvenaudNIPS Workshop on Software Engineering for Machine Learning, 2014
| |

PhD Thesis: Automatic Model Construction with Gaussian Processes
- Introduction - GPs are hard to scale, but easy to train, and don't overfit (mostly).
- Expressing structure with kernels - How to build various kinds of models, with examples.
- Automatic model building - Machines can build custom models for a dataset by combining kernels.
- Automatic model description - Machines can also describe these models visually and with text.
- Deep Gaussian processes - A Bayesian version of deep neural nets, easy to analyze.
- Additive Gaussian processes - A simple way to decompose high-dimensional functions.
- Warped mixture models - Let you learn arbitrarily complicated cluster shapes.
| |

Automatic Construction and Natural-Language Description of Nonparametric Regression Models
We wrote a program which automatically writes reports summarizing automatically constructed models. A prototype for the automatic statistician project. James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, Zoubin GhahramaniAssociation for the Advancement of Artificial Intelligence (AAAI), 2014 pdf | code | slides | example report - airline | example report - solar | more examples | bibtex | |

Avoiding Pathologies in Very Deep Networks
To suggest better neural network architectures, we analyze the properties different priors on compositions of functions. We study deep Gaussian processes, a type of infinitely-wide, deep neural net. We also examine infinitely deep covariance functions. Finally, we show that you get additive covariance if you do dropout on Gaussian processes. David Duvenaud, Oren Rippel, Ryan P. Adams, Zoubin Ghahramani Artificial Intelligence and Statistics, 2014 pdf | code | slides | video of 50-layer warping | video of 50-layer density | bibtex | |

Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces
To optimize the overall architecture of a neural network along with its hyperparameters, we must be able to relate the performance of nets having differing numbers of hyperparameters. To address this problem, we define a new kernel for conditional parameter spaces that explicitly includes information about which parameters are relevant in a given structure. Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, Michael Osborne NIPS workshop on Bayesian optimization, 2013 pdf | code | bibtex | |

Warped Mixtures for Nonparametric Cluster Shapes
If you fit a mixture of Gaussians to a single cluster that is curved or heavy-tailed, your model will report that the data contains many clusters! To fix this problem, we warp a latent mixture of Gaussians into nonparametric cluster shapes. The low-dimensional latent mixture model summarizes the properties of the high-dimensional density manifolds describing the data. Tomoharu Iwata, David Duvenaud, Zoubin GhahramaniUncertainty in Artificial Intelligence, 2013
pdf | code | slides | talk | bibtex | |

Structure Discovery in Nonparametric Regression through Compositional Kernel Search
How could an AI do statistics? To search through an open-ended class of structured, nonparametric regression models, we introduce a simple grammar which specifies composite kernels. These structured models often allow an interpretable decomposition of the function being modeled, as well as long-range extrapolation. Many common regression methods are special cases of this large family of models. David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, Zoubin GhahramaniInternational Conference on Machine Learning, 2013
pdf | code | short slides | long slides | bibtex | |

HarlMCMC Shake
Two short animations illustrate the differences between a Metropolis-Hastings (MH) sampler and a Hamiltonian Monte Carlo (HMC) sampler, to the tune of the Harlem shake. This inspired several followup videos - benchmark your MCMC algorithm on these distributions! by Tamara Broderick and David Duvenaudyoutube | code | |

Active Learning of Model Evidence using Bayesian Quadrature
Instead of the usual Monte-Carlo based methods for computing integrals of likelihood functions, we instead construct a surrogate model of the likelihood function, and infer its integral conditioned on a set of evaluations. This allows us to evaluate the likelihood wherever is most informative, instead of running a Markov chain. This means fewer evaluations to estimate integrals. Michael Osborne, David Duvenaud, Roman Garnett, Carl Rasmussen, Stephen Roberts, Zoubin Ghahramani Neural Information Processing Systems, 2012
pdf | code | slides | related talk | bibtex | |

Optimally-Weighted Herding is Bayesian Quadrature
We prove several connections between a numerical integration method that minimizes a worst-case bound (herding), and a model-based way of estimating integrals (Bayesian quadrature). It turns out that both optimize the same criterion, and that Bayesian Quadrature does this optimally. Ferenc Huszár and David DuvenaudUncertainty in Artificial Intelligence, 2012. Oral presentation.
pdf | code | slides | talk | bibtex | |

Additive Gaussian Processes
When functions have additive structure, we can extrapolate further than with standard Gaussian process models. We show how to efficiently integrate over exponentially-many ways of modeling a function as a sum of low-dimensional functions. David Duvenaud, Hannes Nickisch, Carl Rasmussen Neural Information Processing Systems, 2011
pdf | code | slides | bibtex | |

Multiscale Conditional Random Fields for Semi-supervised Labeling and Classification
How can we take advantage of images labeled only by what objects they contain?
By combining information across different scales, we use image-level labels (such as Canadian Conference on Computer and Robot Vision, 2011
pdf | code | slides | bibtex | |