David Kristjanson Duvenaud
I'm an associate professor at the University of Toronto. My research focuses on constructing deep probabilistic models to help predict, explain and design things. For example:
 Neural ODEs, a kind of continuousdepth neural network,
 Automatic chemical design using generative models,
 Gradientbased hyperparameter tuning,
 Structured latentvariable models for modeling video,
 and Convolutional networks on graphs.
Selected papers
Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations
We perform scalable approximate inference in a recentlyproposed family of continuousdepth Bayesian neural networks. In this model class, uncertainty about separate weights in each layer produces dynamics that follow a stochastic differential equation (SDE). We demonstrate gradientbased stochastic variational inference in this infiniteparameter setting, producing arbitrarilyflexible approximate posteriors. We also derive a novel gradient estimator that approaches zero variance as the approximate posterior approaches the true posterior. This approach inherits the memoryefficient training and tunable precision of neural ODEs. Winnie Xu, Ricky Tian Qi Chen, Xuechen Li, David DuvenaudArtificial Intelligence and Statistics, 2022. paper  code  slides  bibtex 

Complex Momentum for Learning in Games
We generalize gradient descent with momentum for learning in differentiable games to have complexvalued momentum. We give theoretical motivation for our method by proving convergence on bilinear zerosum games for simultaneous and alternating updates. Our method gives realvalued parameter updates, making it a dropin replacement for standard optimizers. We empirically demonstrate that complexvalued momentum can improve convergence in adversarial gameslike generative adversarial networksby showing we can find better solutions with an almost identical computational cost. We also show a practical generalization to a complexvalued Adam variant, which we use to train BigGAN to better inception scores on CIFAR10. Jonathan Lorraine, Paul Vicol, David Acuna, David DuvenaudArtificial Intelligence and Statistics, 2022. paper  slides  bibtex 

MetaLearning to Improve PreTraining
Pretraining large models is useful, but adds many hyperparameters, such as task weights or augmentations in SimCLR. We give a scalable, gradientbased way to tune these hyperparamters. Because exact pretraining gradients are intractable, we approximate them. Specifically, we compose implicit differentiation for the long, almostconverged pretraining stage, with backprop through training for the short finetuning stage. We applied approximate pretraining gradients to tune thousands of task weights for graphbased protein function prediction, and to learn an entire data augmentation neural net for contrastive learning on electrocardiograms. Aniruddh Raghu, Jonathan Lorraine, Simon Kornblith, Matthew McDermott, David DuvenaudNeural Information Processing Systems, 2021 paper  bibtex 

Getting to the Point: Index Sets and ParallelismPreserving Autodiff for Pointful Array Programming
We attempt to combine the clarity and safety of highlevel functional languages with the efficiency and parallelism of lowlevel numerical languages. We treat arrays as eagerlymemoized functions on typed index sets, allowing abstract function manipulations, such as currying, to work on arrays. In contrast to composing primitive bulkarray operations, we argue for an explicit nested indexing style that mirrors application of functions to arguments. We also introduce a finegrained typed effects system which affords concise and automaticallyparallelized inplace updates. Adam Paszke, Daniel D. Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew Johnson, Jonathan RaganKelley, Dougal MaclaurinInternational Conference on Functional Programming, 2021. Distinguished Paper Award. paper  repo  talk  bibtex 

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions
We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a MetropolisHastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energybased models on high dimensional discrete data. This approach outperforms variational autoencoders and existing energybased models. Finally, we give bounds showing that our approach is nearoptimal in the class of samplers which propose local updates. Will Grathwohl, Milad Hashemi, Kevin Swersky, David Duvenaud, Chris MaddisonInternational Conference on Learning Representations, 2021. Outstanding Paper Award Honorable Mention. paper  slides  talk  bibtex 

Teaching with Commentaries
We metalearn information helpful for training on a particular task or dataset, leveraging recent work on implicit differentiation. We explore applications such as learning weights for individual training examples, parameterizing labeldependent data augmentation policies, and representing attention masks that highlight salient image regions. Aniruddh Raghu, Maithra Raghu, Simon Kornblith, David Duvenaud, Geoffrey HintonInternational Conference on Learning Representations, 2021 paper  bibtex 

No MCMC for me: Amortized Sampling for Fast and Stable Training of EnergyBased Models
EnergyBased Models (EBMs) present a flexible and appealing way to represent uncertainty. In this work, we present a simple method for training EBMs at scale which uses an entropyregularized generator to amortize the MCMC sampling typically used in EBM training. We apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training. This allows us to extend JEM models to semisupervised classification on tabular data from a variety of continuous domains. Will Grathwohl*, Jacob Kelly*, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, David DuvenaudInternational Conference on Learning Representations, 2021 paper  code  bibtex 

SelfTuning Stochastic Optimization with CurvatureAware Gradient Filtering
We explore the use of exact persample Hessianvector products and gradients to construct optimizers that are selftuning and hyperparameterfree. Based on a dynamical model, we derive a curvaturecorrected, noiseadaptive online gradient estimate. We prove that our modelbased procedure converges in the noisy quadratic setting. Though we do not see similar gains in deep learning tasks, we match the performance of welltuned optimizers. Our initial experiments indicate that when training deep nets our optimizer works too well, in a sense  it descends into regions of high variance and high curvature early on in the optimization, and gets stuck there. Ricky Tian Qi Chen, Dami Choi, Lukas Balles, David Duvenaud, Philipp HennigNeurIPS Workshop on "I Can't Believe It's Not Better!", 2020 paper  slides  talk  bibtex 

Learning Differential Equations that are Easy to Solve
Neural ODEs become expensive to solve numerically as training progresses. We introduce a differentiable surrogate for the time cost of standard numerical solvers using higherorder derivatives of solution trajectories. These derivatives are efficient to compute with Taylormode automatic differentiation. Optimizing this additional objective trades model performance against the time cost of solving the learned dynamics. Jacob Kelly*, Jesse Bettencourt*, Matthew Johnson, David DuvenaudNeural Information Processing Systems, 2020. paper  code  bibtex 

Scalable Gradients for Stochastic Differential Equations
We generalize the adjoint sensitivity method to stochastic differential equations, allowing timeefficient and constantmemory computation of gradients with highorder adaptive solvers. Specifically, we derive a stochastic differential equation whose solution is the gradient, a memoryefficient algorithm for caching noise, and conditions under which numerical solutions converge. In addition, we combine our method with gradientbased stochastic variational inference for latent stochastic differential equations. We use our method to fit stochastic dynamics defined by neural networks, achieving competitive performance on a 50dimensional motion capture dataset. Xuechen Li, TingKam Leonard Wong, Ricky Tian Qi Chen, David DuvenaudArtificial Intelligence and Statistics, 2020. paper  code  slides  talk  bibtex 

Optimizing Millions of Hyperparameters by Implicit Differentiation
We use the implicit function theorem to scalably approximate gradients of the validation loss with respect to hyperparameters. This lets us train networks with millions of weights and millions of hyperparameters. For instance, we learn a dataaugmentation network  where every weight is a hyperparameter tuned for validation performance  that outputs augmented training examples, from scratch. We also learn a distilled dataset where each feature in each datapoint is a hyperparameter, and tune millions of regularization hyperparameters. Jonathan Lorraine, Paul Vicol, David DuvenaudArtificial Intelligence and Statistics, 2020. paper  slides  bibtex 

Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One
We show that you can reinterpret standard classification architectures as energybased generative models and train them as such. Doing this allows us to achieve stateoftheart performance at both generative and discriminative modeling in a single model. Adding this energybased training also improves calibration, outofdistribution detection, and adversarial robustness. Will Grathwohl, KuanChieh Wang, JörnHenrik Jacobsen, David Duvenaud, Mohammad Norouzi, Kevin SwerskyInternational Conference on Learning Representations, 2020. paper  code  video poster  bibtex 

SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models
We introduce an unbiased estimator of the log marginal likelihood and its gradients for latent variable models. In an encoderdecoder architecture, the parameters of the encoder can be optimized to minimize its variance of this estimator. We show that models trained using our estimator give better testset likelihoods than a standard importancesampling based approach for the same average computational cost. Yucen Luo, Alex Beatson, Mohammad Norouzi, Jun Zhu, David Duvenaud, Ryan P. Adams, Ricky Tian Qi ChenInternational Conference on Learning Representations, 2020. paper  video poster  bibtex 

Neural Networks with Cheap Differential Operators
We introduce a family of restricted neural network architectures that allow efficient computation of a family of differential operators involving dimensionwise derivatives, such as the divergence. Our proposed architecture has a Jacobian matrix composed of diagonal and hollow (zerodiagonal) components. We demonstrate these cheap differential operators on rootfinding problems, exact density evaluation for continuous normalizing flows, and evaluating the FokkerPlanck equation. Ricky Tian Qi Chen and David DuvenaudNeural Information Processing Systems, 2019. paper  slides  bibtex 

Efficient Graph Generation with Graph Recurrent Attention Networks
We propose a new family of efficient and expressive deep generative models of graphs. We use graph neural networks to generate new edges conditioned on the alreadysampled parts of the graph, reducing dependence on node ordering and bypasses the bottleneck caused by the sequential nature of RNNs. We achieve stateoftheart time efficiency and sample quality compared to previous models, and generate graphs of up to 5000 nodes. Renjie Liao, Yujia Li, Yang Song, Shenlong Wang, Charlie Nash, William L. Hamilton, David Duvenaud, Raquel Urtasun, Rich ZemelNeural Information Processing Systems, 2019. paper  code  bibtex 

Latent ODEs for IrregularlySampled Time Series
Time series with nonuniform intervals occur in many applications, and are difficult to model using standard recurrent neural networks. We generalize RNNs to have continuoustime hidden dynamics defined by ordinary differential equations. These models can naturally handle arbitrary time gaps between observations, and can explicitly model the probability of observation times using Poisson processes. Yulia Rubanova, Ricky Tian Qi Chen, David DuvenaudNeural Information Processing Systems, 2019. paper  code  bibtex 

Residual Flows for Invertible Generative Modeling
Invertible residual networks provide transformations where only Lipschitz conditions rather than architectural constraints are needed for enforcing invertibility. We give a tractable unbiased estimate of the log density, and improve these models in other ways. The resulting approach, called Residual Flows, achieves stateoftheart performance on density estimation amongst flowbased models. Ricky Tian Qi Chen, Jens Behrmann, David Duvenaud, JörnHenrik JacobsenNeural Information Processing Systems, 2019. paper  code  slides  bibtex 

Invertible Residual Networks
We show that standard ResNet architectures can be made invertible, allowing the same model to be used for classification, density estimation, and generation. Our approach only requires adding a simple normalization step during training. Invertible ResNets define a generative model which can be trained by maximum likelihood on unlabeled data. To compute likelihoods, we introduce a tractable approximation to the Jacobian logdeterminant of a residual block. Our empirical evaluation shows that invertible ResNets perform competitively with both stateoftheart image classifiers and flowbased generative models, something that has not been previously achieved with a single architecture. Jens Behrmann, Will Grathwohl, Ricky Tian Qi Chen, David Duvenaud, JörnHenrik JacobsenInternational Conference on Machine Learning, 2019. paper  code  slides  bibtex 

SelfTuning Networks: Bilevel Optimization of Hyperparameters using Structured BestResponse Functions
Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We adapt regularization hyperparameters for neural networks by fitting compact approximations to the bestresponse function, which maps hyperparameters to optimal weights and biases. We show how to construct scalable bestresponse approximations for neural networks by modeling the bestresponse as a single network whose hidden units are gated conditionally on the regularizer. Matthew MacKay, Paul Vicol, Jonathan Lorraine, David Duvenaud, Roger GrosseInternational Conference on Learning Representations, 2019. paper  code  slides  bibtex 

FFJORD: Freeform Continuous Dynamics for Scalable Reversible Generative Models
Training normalized generative models such as Real NVP or Glow requires restricting their architectures to allow cheap computation of Jacobian determinants. Alternatively, if the transformation is specified by an ordinary differential equation, then the Jacobian's trace can be used. We use Hutchinson's trace estimator to give a scalable unbiased estimate of the logdensity. The result is a continuoustime invertible generative model with unbiased density estimation and onepass sampling, while allowing unrestricted neural network architectures. We demonstrate our approach on highdimensional density estimation, image generation, and variational inference, improving the stateoftheart among exact likelihood methods with efficient sampling. Will Grathwohl, Ricky Tian Qi Chen, Jesse Bettencourt, Ilya Sutskever, David DuvenaudInternational Conference on Learning Representations, 2019. Oral presentation. paper  slides  code  bibtex 

Explaining Image Classifiers by Counterfactual Generation
When an image classifier makes a prediction, which parts of the image are relevant and why? We can rephrase this question to ask: which parts of the image, if they were not seen by the classifier, would most change its decision? Producing an answer requires marginalizing over images that could have been seen but weren't. We can sample plausible image infills by conditioning a generative model on the rest of the image. We then optimize to find the image regions that most change the classifier's decision after infill. Our approach contrasts with adhoc infilling approaches, such as blurring or injecting noise, which generate inputs far from the data distribution, and ignore informative relationships between different parts of the image. Our method produces more compact and relevant saliency maps, with fewer artifacts compared to previous methods. ChunHao Chang, Elliot Creager, Anna Goldenberg, David DuvenaudInternational Conference on Learning Representations, 2019. paper  code  slides  bibtex 

Stochastic Hyperparameter Optimization Through Hypernetworks
Models are usually tuned by nesting optimization of model weights inside the optimization of hyperparameters. We collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our method trains a neural net to output approximately optimal weights as a function of hyperparameters. This method converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters. Jonathan Lorraine, David Duvenaudpaper  bibtex  slides  code 

Neural Ordinary Differential Equations
We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a blackbox differential equation solver. These continuousdepth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuousdepth residual networks and continuoustime latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows endtoend training of ODEs within larger models. Ricky Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, David DuvenaudNeural Information Processing Systems, 2018. Best paper award. paper  bibtex  slides  talk  code 

Isolating Sources of Disentanglement in Variational Autoencoders
Variational autoencoders can be regularized to produce disentangled representations, in which each latent dimension has a distinct meaning. However, existing regularization schemes also hurt the model's ability to model the data. We show a simple method to regularize only the part that causes disentanglement. We also give a principled, classifierfree measure of disentanglement called the mutual information gap. Ricky Tian Qi Chen, Xuechen Li, Roger Grosse, David DuvenaudNeural Information Processing Systems, 2018. Oral presentation. paper  bibtex  slides  code 

Noisy Natural Gradient as Variational Inference
Bayesian neural nets combine the flexibility of deep learning with uncertainty estimation, but are usually approximated using a fullyfactorized Guassian. We show that natural gradient ascent with adaptive weight noise implicitly fits a variational Gassuain posterior. This insight allows us to train fullcovariance, fully factorized, or matrixvariate Gaussian variational posteriors using noisy versions of natural gradient, Adam, and KFAC, respectively, allowing us to scale to modernsize convnets. Our noisy KFAC algorithm makes better predictions and has bettercalibrated uncertainty than existing methods. This leads to more efficient exploration in active learning and reinforcement learning. Guodong Zhang, Shengyang Sun, David Duvenaud, Roger GrosseInternational Conference on Machine Learning, 2018 paper  bibtex  video  code 

Inference Suboptimality in Variational Autoencoders
Amortized inference allows latentvariable models to scale to large datasets. The quality of approximate inference is determined by two factors: a) the capacity of the variational distribution to match the true posterior and b) the ability of the recognition net to produce good variational parameters for each datapoint. We show that the recognition net giving bad variational parameters is often a bigger problem than using a Gaussian approximate posterior, because the generator can adapt to it. 

Backpropagation through the Void: Optimizing control variates for blackbox gradient estimation
We learn lowvariance, unbiased gradient estimators for any function of random variables. We backprop through a neural net surrogate of the original function, which is optimized to minimize gradient variance during the optimization of the original objective. We train discrete latentvariable models, and do continuous and discrete reinforcement learning with an adaptive, actionconditional baseline. Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, David DuvenaudInternational Conference on Learning Representations, 2018 paper  code  slides  bibtex  
Automatic chemical design using a datadriven continuous representation of molecules
We develop a molecular autoencoder, which converts discrete representations of molecules to and from a continuous representation. This allows gradientbased optimization through the space of chemical compounds. Continuous representations also let us generate novel chemicals by interpolating between molecules. Rafa GómezBombarelli, Jennifer Wei, David Duvenaud, José Miguel HernándezLobato, Benjamín SánchezLengeling, Sheberla, Dennis, Jorge AguileraIparraguirre, Timothy Hirzel, Ryan P. Adams, Alán AspuruGuzikAmerican Chemical Society Central Science, 2018 paper  bibtex  slides  code  
Sticking the landing: Simple, lowervariance gradient estimators for variational inference
We give a simple recipe for reducing the variance of the gradient of the variational evidence lower bound.
The entire trick is just removing one term from the gradient.
Removing this term leaves an unbiased gradient estimator whose variance approaches zero as the approximate posterior approaches the exact posterior.
We also generalize this trick to mixtures and importanceweighted posteriors.
Neural Information Processing Systems, 2017 paper  bibtex  code  
Reinterpreting importanceweighted autoencoders
The standard interpretation of importanceweighted autoencoders is that they maximize a tighter, multisample lower bound than the standard evidence lower bound. We give an alternate interpretation: it optimizes the standard lower bound, but using a more complex distribution, which we show how to visualize. Chris Cremer, Quaid Morris, David DuvenaudICLR Workshop track, 2017 paper  bibtex  
Composing graphical models with neural networks for structured representations and fast inference
We propose a general modeling and inference framework that combines the complementary strengths of probabilistic graphical models and deep learning methods. Our model family composes latent graphical models with neural network observation likelihoods. All components are trained simultaneously. We use this framework to automatically segment and categorize mouse behavior from raw depth video. Matthew Johnson, David Duvenaud, Alex Wiltschko, Bob Datta, Ryan P. AdamsNeural Information Processing Systems, 2016 paper  video  code  slides  bibtex  animation  
Probing the Compositionality of Intuitive Functions
How do people learn about complex functional structure? We propose that humans use compositionality: complex structure is decomposed into simpler building blocks. We formalize this idea using a grammar over Gaussian process kernels. We show that people prefer compositional extrapolations, and argue that this is consistent with broad principles of human cognition. Eric Shulz, Joshua B. Tenenbaum, David Duvenaud, Maarten Speekenbrink, Samuel J. GershmanNeural Information Processing Systems, 2016 paper  video  bibtex  
Autograd: Reversemode differentiation of native Python
Autograd automatically differentiates native Python and Numpy code. It can handle loops, ifs, recursion and closures, and it can even take derivatives of its own derivatives. It uses reversemode differentiation (a.k.a. backpropagation), which means it's efficient for gradientbased optimization. Check out the tutorial and the examples directory. Dougal Maclaurin, David Duvenaud, Matthew Johnsoncode  bibtex  slides  
Early Stopping as Nonparametric Variational Inference
Stochastic gradient descent samples from a nonparametric distribution, implicitly defined by the transformation of the initial distribution by an optimizer. We track the loss of entropy during optimization to get a scalable estimate of the marginal likelihood. This Bayesian interpretation of SGD gives a theoretical foundation for popular tricks such as early stopping and ensembling. We evaluate our marginal likelihood estimator on neural network models. David Duvenaud, Dougal Maclaurin, Ryan P. AdamsArtificial Intelligence and Statistics, 2016 paper  slides  code  bibtex  
Convolutional Networks on Graphs for Learning Molecular Fingerprints
We introduce a convolutional neural network that operates directly on graphs, allowing endtoend learning of the entire feature pipeline. This architecture generalizes standard molecular fingerprints. These datadriven features are more interpretable, and have better predictive performance on a variety of tasks. Related work led to our Nature Materials paper. David Duvenaud, Dougal Maclaurin, Jorge AguileraIparraguirre, Rafa GómezBombarelli, Timothy Hirzel, Alán AspuruGuzik, Ryan P. AdamsNeural Information Processing Systems, 2015 pdf  slides  code  bibtex  
Gradientbased Hyperparameter Optimization through Reversible Learning
Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of the validation loss with respect to all hyperparameters by differentiating through the entire training procedure. This lets us optimize thousands of hyperparameters, including stepsize and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural net architectures. Dougal Maclaurin, David Duvenaud, Ryan P. AdamsInternational Conference on Machine Learning, 2015 pdf  slides  code  bibtex  
Probabilistic ODE Solvers with RungeKutta Means
We show that some standard differential equation solvers are equivalent to Gaussian process predictive means, giving them a natural way to handle uncertainty. This work is part of the larger probabilistic numerics research agenda, which interprets numerical algorithms as inference procedures so they can be better understood and extended. Michael Schober, David Duvenaud, Philipp HennigNeural Information Processing Systems, 2014. Oral presentation. pdf  slides  bibtex  
PhD Thesis: Automatic Model Construction with Gaussian Processes
 
Automatic Construction and NaturalLanguage Description of Nonparametric Regression Models
We wrote a program which automatically writes reports summarizing automatically constructed models. A prototype for the automatic statistician project. James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, Zoubin GhahramaniAssociation for the Advancement of Artificial Intelligence (AAAI), 2014 pdf  code  slides  example report  airline  example report  solar  more examples  bibtex  
Avoiding Pathologies in Very Deep Networks
To suggest better neural network architectures, we analyze the properties of different priors on compositions of functions. We study deep Gaussian processes, a type of infinitelywide, deep neural net. We also examine infinitely deep covariance functions. Finally, we show that you get additive covariance if you do dropout on Gaussian processes. David Duvenaud, Oren Rippel, Ryan P. Adams, Zoubin GhahramaniArtificial Intelligence and Statistics, 2014 pdf  code  slides  video of 50layer warping  video of 50layer density  bibtex  
Warped Mixtures for Nonparametric Cluster Shapes
If you fit a mixture of Gaussians to a single cluster that is curved or heavytailed, your model will report that the data contains many clusters! To fix this problem, we warp a latent mixture of Gaussians into nonparametric cluster shapes. The lowdimensional latent mixture model summarizes the properties of the highdimensional density manifolds describing the data. Tomoharu Iwata, David Duvenaud, Zoubin GhahramaniUncertainty in Artificial Intelligence, 2013 pdf  code  slides  talk  bibtex  
Structure Discovery in Nonparametric Regression through Compositional Kernel Search
How could an AI do statistics? To search through an openended class of structured, nonparametric regression models, we introduce a simple grammar which specifies composite kernels. These structured models often allow an interpretable decomposition of the function being modeled, as well as longrange extrapolation. Many common regression methods are special cases of this large family of models. David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, Zoubin GhahramaniInternational Conference on Machine Learning, 2013 pdf  code  short slides  long slides  bibtex  
HarlMCMC Shake
Two short animations illustrate the differences between a MetropolisHastings (MH) sampler and a Hamiltonian Monte Carlo (HMC) sampler, to the tune of the Harlem shake. This inspired several followup videos  benchmark your MCMC algorithm on these distributions! by Tamara Broderick and David Duvenaudvideo  code  
Active Learning of Model Evidence using Bayesian Quadrature
Instead of the usual MonteCarlo based methods for computing integrals of likelihood functions, we instead construct a surrogate model of the likelihood function, and infer its integral conditioned on a set of evaluations. This allows us to evaluate the likelihood wherever is most informative, instead of running a Markov chain. This means fewer evaluations to estimate integrals. Michael Osborne, David Duvenaud, Roman Garnett, Carl Rasmussen, Stephen Roberts, Zoubin GhahramaniNeural Information Processing Systems, 2012 pdf  code  slides  related talk  bibtex  
OptimallyWeighted Herding is Bayesian Quadrature
We prove several connections between a numerical integration method that minimizes a worstcase bound (herding), and a modelbased way of estimating integrals (Bayesian quadrature). It turns out that both optimize the same criterion, and that Bayesian Quadrature does this optimally. Ferenc Huszár and David DuvenaudUncertainty in Artificial Intelligence, 2012. Oral presentation. pdf  code  slides  talk  bibtex  
Additive Gaussian Processes
When functions have additive structure, we can extrapolate further than with standard Gaussian process models. We show how to efficiently integrate over exponentiallymany ways of modeling a function as a sum of lowdimensional functions. David Duvenaud, Hannes Nickisch, Carl RasmussenNeural Information Processing Systems, 2011 pdf  code  slides  bibtex  
Multiscale Conditional Random Fields for Semisupervised Labeling and Classification
How can we take advantage of images labeled only by what objects they contain?
By combining information across different scales, we use imagelevel labels (such as Canadian Conference on Computer and Robot Vision, 2011 pdf  code  slides  bibtex  