David Kristjanson Duvenaud
I'm an assistant professor at the University of Toronto. My research focuses on constructing deep probabilistic models to help predict, explain and design things. Highlights include:
 Neural ODEs, a kind of continuousdepth neural network,
 Automatic chemical design using generative models to propose promising molecules,
 Gradientbased metalearning by differentiation through gradient descent,
 Structured latentvariable models which can categorize mouse behavior without supervision,
 Convolutional networks on graphs for predicting properties of molecules.
Email: duvenaud@cs.toronto.edu Teaching:

Current graduate students:

Preprints
FFJORD: Freeform Continuous Dynamics for Scalable Reversible Generative Models
Reversible generative models map points from a simple distribution to a complex distribution through an invertible neural network. Training these models requires restricting their architectures to allow cheap computation of Jacobian determinants. Alternatively, the Jacobian's trace can be used if the transformation is specified by an ordinary differential equation. In this paper, we use Hutchinson's trace estimator to give a scalable unbiased estimate of the logdensity. The result is a continuoustime invertible generative model with unbiased density estimation and onepass sampling, while allowing unrestricted neural network architectures. We demonstrate our approach on highdimensional density estimation, image generation, and variational inference, improving the stateoftheart among exact likelihood methods with efficient sampling. Will Grathwohl, Ricky Tian Qi Chen, Jesse Bettencourt, Ilya Sutskever, David Duvenaudarxiv 

Neural Ordinary Differential Equations
We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a blackbox differential equation solver. These continuousdepth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuousdepth residual networks and continuoustime latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows endtoend training of ODEs within larger models. Ricky Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, David DuvenaudTo appear in Neural Information Processing Systems, 2018. Oral presentation. arxiv  bibtex  slides  implementation 

Stochastic Hyperparameter Optimization Through Hypernetworks
Models are usually tuned by nesting optimization of model weights inside the optimization of hyperparameters. We collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our method trains a neural net to output approximately optimal weights as a function of hyperparameters. This method converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters. Jonathan Lorraine, David Duvenaudarxiv  bibtex  slides  code 

Isolating Souces of Disentanglement in Variational Autoencoders
We decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables. We use this to motivate our βTCVAE (Total Correlation Variational Autoencoder), a refinement of the stateoftheart βVAE objective for learning disentangled representations, requiring no additional hyperparameters during training. We further propose a principled classifierfree measure of disentanglement called the mutual information gap (MIG). We perform extensive quantitative and qualitative experiments, in both restricted and nonrestricted settings, and show a strong relation between total correlation and disentanglement, when the latent variables model is trained using our framework. Ricky Tian Qi Chen, Xuechen Li, Roger Grosse, David DuvenaudTo appear in Neural Information Processing Systems, 2018. Oral presentation. arxiv  bibtex 
Selected papers
Noisy Natural Gradient as Variational Inference
Bayesian neural nets combine the flexibility of deep learning with uncertainty estimation, but are usually approximated using a fullyfactorized Guassian. We show that natural gradient ascent with adaptive weight noise implicitly fits a variational Gassuain posterior. This insight allows us to train fullcovariance, fully factorized, or matrixvariate Gaussian variational posteriors using noisy versions of natural gradient, Adam, and KFAC, respectively, allowing us to scale to modernsize convnets. Our noisy KFAC algorithm makes better predictions and has bettercalibrated uncertainty than existing methods. This leads to more efficient exploration in active learning and reinforcement learning. Guodong Zhang, Shengyang Sun, David Duvenaud, Roger GrosseInternational Conference on Machine Learning, 2018 arxiv  bibtex 

Inference Suboptimality in Variational Autoencoders
Amortized inference allows latentvariable models to scale to large datasets. The quality of approximate inference is determined by two factors: a) the capacity of the variational distribution to match the true posterior and b) the ability of the recognition net to produce good variational parameters for each datapoint. We show that the recognition net giving bad variational parameters is often a bigger problem than using a Gaussian approximate posterior, because the generator can adapt to it. 

Backpropagation through the Void: Optimizing control variates for blackbox gradient estimation
We learn lowvariance, unbiased gradient estimators for any function of random variables. We backprop through a neural net surrogate of the original function, which is optimized to minimize gradient variance during the optimization of the original objective. We train discrete latentvariable models, and do continuous and discrete reinforcement learning with an adaptive, actionconditional baseline. Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, David DuvenaudInternational Conference on Learning Representations, 2018 arxiv  code  slides  bibtex  
Automatic chemical design using a datadriven continuous representation of molecules
We develop a molecular autoencoder, which converts discrete representations of molecules to and from a continuous representation. This allows gradientbased optimization through the space of chemical compounds. Continuous representations also let us generate novel chemicals by interpolating between molecules. Rafa GómezBombarelli, Jennifer Wei, David Duvenaud, José Miguel HernándezLobato, Benjamín SánchezLengeling, Sheberla, Dennis, Jorge AguileraIparraguirre, Timothy Hirzel, Ryan P. Adams, Alán AspuruGuzikAmerican Chemical Society Central Science, 2018 arxiv  bibtex  slides  code  
Sticking the landing: Simple, lowervariance gradient estimators for variational inference
We give a simple recipe for reducing the variance of the gradient of the variational evidence lower bound.
The entire trick is just removing one term from the gradient.
Removing this term leaves an unbiased gradient estimator whose variance approaches zero as the approximate posterior approaches the exact posterior.
We also generalize this trick to mixtures and importanceweighted posteriors.
Neural Information Processing Systems, 2017 arxiv  bibtex  code  
Reinterpreting importanceweighted autoencoders
The standard interpretation of importanceweighted autoencoders is that they maximize a tighter, multisample lower bound than the standard evidence lower bound. We give an alternate interpretation: it optimizes the standard lower bound, but using a more complex distribution, which we show how to visualize. Chris Cremer, Quaid Morris, David DuvenaudICLR Workshop track, 2017 arxiv  bibtex  
Composing graphical models with neural networks for structured representations and fast inference
We propose a general modeling and inference framework that combines the complementary strengths of probabilistic graphical models and deep learning methods. Our model family composes latent graphical models with neural network observation likelihoods. All components are trained simultaneously. We use this framework to automatically segment and categorize mouse behavior from raw depth video. Matthew Johnson, David Duvenaud, Alex Wiltschko, Bob Datta, Ryan P. AdamsNeural Information Processing Systems, 2016 preprint  video  code  slides  bibtex  animation  
Probing the Compositionality of Intuitive Functions
How do people learn about complex functional structure? We propose that humans use compositionality: complex structure is decomposed into simpler building blocks. We formalize this idea using a grammar over Gaussian process kernels. We show that people prefer compositional extrapolations, and argue that this is consistent with broad principles of human cognition. Eric Shulz, Joshua B. Tenenbaum, David Duvenaud, Maarten Speekenbrink, Samuel J. GershmanNeural Information Processing Systems, 2016 preprint  video  bibtex  
Blackbox stochastic variational inference in five lines of Python
We emphasize how easy it is to construct scalable inference methods using only automatic differentiation. We present code that computes stochastic gradients of the evidence lower bound for any differentiable posterior. For example, we do stochastic variational inference in a deep Bayesian neural network. David Duvenaud, Ryan P. AdamsNIPS Workshop on Blackbox Learning and Inference, 2015 preprint  code  bibtex  video  
Autograd: Reversemode differentiation of native Python
Autograd automatically differentiates native Python and Numpy code. It can handle loops, ifs, recursion and closures, and it can even take derivatives of its own derivatives. It uses reversemode differentiation (a.k.a. backpropagation), which means it's efficient for gradientbased optimization. Check out the tutorial and the examples directory. Dougal Maclaurin, David Duvenaud, Matthew Johnsoncode  bibtex  slides  
Early Stopping as Nonparametric Variational Inference
Stochastic gradient descent samples from a nonparametric distribution, implicitly defined by the transformation of the initial distribution by an optimizer. We track the loss of entropy during optimization to get a scalable estimate of the marginal likelihood. This Bayesian interpretation of SGD gives a theoretical foundation for popular tricks such as early stopping and ensembling. We evaluate our marginal likelihood estimator on neural network models. David Duvenaud, Dougal Maclaurin, Ryan P. AdamsArtificial Intelligence and Statistics, 2016 preprint  slides  code  bibtex  
Convolutional Networks on Graphs for Learning Molecular Fingerprints
We introduce a convolutional neural network that operates directly on graphs, allowing endtoend learning of the entire feature pipeline. This architecture generalizes standard molecular fingerprints. These datadriven features are more interpretable, and have better predictive performance on a variety of tasks. Related work led to our Nature Materials paper. David Duvenaud, Dougal Maclaurin, Jorge AguileraIparraguirre, Rafa GómezBombarelli, Timothy Hirzel, Alán AspuruGuzik, Ryan P. AdamsNeural Information Processing Systems, 2015 pdf  slides  code  bibtex  
Gradientbased Hyperparameter Optimization through Reversible Learning
Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of the validation loss with respect to all hyperparameters by differentiating through the entire training procedure. This lets us optimize thousands of hyperparameters, including stepsize and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural net architectures. Dougal Maclaurin, David Duvenaud, Ryan P. AdamsInternational Conference on Machine Learning, 2015 pdf  slides  code  bibtex  
Probabilistic ODE Solvers with RungeKutta Means
We show that some standard differential equation solvers are equivalent to Gaussian process predictive means, giving them a natural way to handle uncertainty. This work is part of the larger probabilistic numerics research agenda, which interprets numerical algorithms as inference procedures so they can be better understood and extended. Michael Schober, David Duvenaud, Philipp HennigNeural Information Processing Systems, 2014. Oral presentation. pdf  slides  bibtex  
Testing MCMC Code When can we trust our experiments? We've collected some simple sanity checks that catch a wide class of bugs. Related: Richard Mann wrote a gripping blog post about the aftermath of finding a subtle bug in one of his landmark papers. Roger Grosse, David DuvenaudNIPS Workshop on Software Engineering for Machine Learning, 2014  
PhD Thesis: Automatic Model Construction with Gaussian Processes
 
Automatic Construction and NaturalLanguage Description of Nonparametric Regression Models
We wrote a program which automatically writes reports summarizing automatically constructed models. A prototype for the automatic statistician project. James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, Zoubin GhahramaniAssociation for the Advancement of Artificial Intelligence (AAAI), 2014 pdf  code  slides  example report  airline  example report  solar  more examples  bibtex  
Avoiding Pathologies in Very Deep Networks
To suggest better neural network architectures, we analyze the properties different priors on compositions of functions. We study deep Gaussian processes, a type of infinitelywide, deep neural net. We also examine infinitely deep covariance functions. Finally, we show that you get additive covariance if you do dropout on Gaussian processes. David Duvenaud, Oren Rippel, Ryan P. Adams, Zoubin GhahramaniArtificial Intelligence and Statistics, 2014 pdf  code  slides  video of 50layer warping  video of 50layer density  bibtex  
Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces
To optimize the overall architecture of a neural network along with its hyperparameters, we must be able to relate the performance of nets having differing numbers of hyperparameters. To address this problem, we define a new kernel for conditional parameter spaces that explicitly includes information about which parameters are relevant in a given structure. Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, Michael OsborneNIPS workshop on Bayesian optimization, 2013 pdf  code  bibtex  
Warped Mixtures for Nonparametric Cluster Shapes
If you fit a mixture of Gaussians to a single cluster that is curved or heavytailed, your model will report that the data contains many clusters! To fix this problem, we warp a latent mixture of Gaussians into nonparametric cluster shapes. The lowdimensional latent mixture model summarizes the properties of the highdimensional density manifolds describing the data. Tomoharu Iwata, David Duvenaud, Zoubin GhahramaniUncertainty in Artificial Intelligence, 2013 pdf  code  slides  talk  bibtex  
Structure Discovery in Nonparametric Regression through Compositional Kernel Search
How could an AI do statistics? To search through an openended class of structured, nonparametric regression models, we introduce a simple grammar which specifies composite kernels. These structured models often allow an interpretable decomposition of the function being modeled, as well as longrange extrapolation. Many common regression methods are special cases of this large family of models. David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, Zoubin GhahramaniInternational Conference on Machine Learning, 2013 pdf  code  short slides  long slides  bibtex  
HarlMCMC Shake
Two short animations illustrate the differences between a MetropolisHastings (MH) sampler and a Hamiltonian Monte Carlo (HMC) sampler, to the tune of the Harlem shake. This inspired several followup videos  benchmark your MCMC algorithm on these distributions! by Tamara Broderick and David Duvenaudyoutube  code  
Active Learning of Model Evidence using Bayesian Quadrature
Instead of the usual MonteCarlo based methods for computing integrals of likelihood functions, we instead construct a surrogate model of the likelihood function, and infer its integral conditioned on a set of evaluations. This allows us to evaluate the likelihood wherever is most informative, instead of running a Markov chain. This means fewer evaluations to estimate integrals. Michael Osborne, David Duvenaud, Roman Garnett, Carl Rasmussen, Stephen Roberts, Zoubin GhahramaniNeural Information Processing Systems, 2012 pdf  code  slides  related talk  bibtex  
OptimallyWeighted Herding is Bayesian Quadrature
We prove several connections between a numerical integration method that minimizes a worstcase bound (herding), and a modelbased way of estimating integrals (Bayesian quadrature). It turns out that both optimize the same criterion, and that Bayesian Quadrature does this optimally. Ferenc Huszár and David DuvenaudUncertainty in Artificial Intelligence, 2012. Oral presentation. pdf  code  slides  talk  bibtex  
Additive Gaussian Processes
When functions have additive structure, we can extrapolate further than with standard Gaussian process models. We show how to efficiently integrate over exponentiallymany ways of modeling a function as a sum of lowdimensional functions. David Duvenaud, Hannes Nickisch, Carl RasmussenNeural Information Processing Systems, 2011 pdf  code  slides  bibtex  
Multiscale Conditional Random Fields for Semisupervised Labeling and Classification
How can we take advantage of images labeled only by what objects they contain?
By combining information across different scales, we use imagelevel labels (such as Canadian Conference on Computer and Robot Vision, 2011 pdf  code  slides  bibtex  