Drawing inspiration from many of my great math professors from undergrad,
this website will be poorly made, and have lots improperly formatted HTML.
This website is getting progressively more and more out of date. For an up-to-date list of my publications, please see my Google Scholar Page.
I completed my undergraduate degree in Mathematics at MIT in 2014.
I completed my PhD in the MachineLearning
Group here at the University of Toronto. I am now a research scientist at Deepmind in New York City.
Febuary 2019 - June 2021, I worked part-time at Google Brain in Toronto.
I was awarded the Google PhD fellowship in Machine Learning for 2021 but had to decline due to graduating before the fellowship would begin. I completed my PhD this summer 2021. Since then I have worked a Google Deepmind in New York City. In October 2023, I was promoted to Senior Research Scientist.
Over the past year I was a core member of the team that built Lyria, Google Deepmind's state-of-the-art generative model for music.
Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc: Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, Will Grathwohl
We examine compositionality in diffusion models. That is, composing multiple pre-trained diffusion models together to create models for novel distributions. While simple theoretically, we find composing these models in practice does not perform well. We attribute this to failures in sampling. We remedy this issue by using more advanced sampling techniques such as HMC. Further, we propose to use an energy-based prameterization of diffusion models which enables the use of Metropolis-Corrected MCMC samplers which further improves performance in compositional generation.
Denoising Diffusion Samplers: Francisco Vargas, Will Grathwohl, Arnaud Doucet
We begin to explore how ideas from diffusion models can be applied to the task of sampling from unormalized probability distributions.
Score-based diffusion meets annealed importance sampling: Arnaud Doucet, Will Grathwohl, Alexander G Matthews, Heiko Strathmann
We draw connections between the long-standing method of annealed importance sampling and the recently-succesful diffusion models. We find that most AIS methods use a convenient (but sub-optimal) backwards transition kernel and show that, taking inspiration from diffusion models, we can learn a backwards kernel which notably improves the performance of AIS in many situations.
Learning to Navigate Wikipedia by Taking Random Walks": Manzil Zaheer, Kenneth Marino, Will Grathwohl, John Schultz, Wendy Shang, Sheila Babayan, Arun Ahuja, Ishita Dasgupta, Christine Kaeser-Chen, Rob Fergus
In this work we interpret Wikipedia as a graph with each paragraph a node connected to the paragraphs above and below as well as those with outgoing hyper-links. We then train agents to learn to navigate this graph using only the text at each node. We find that these agents (and their learned embeddings) are useful for downstream tasks such as fact-verification and question-answering.
Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models: Eli N. Weinstein, Alan N. Amin, Will S. Grathwohl, Daniel Kassler, Jean Disset, Debora Marks
We interpret common stochastic DNA synthesis procedures as probabilistic generative models. From this interpretation we can apply standard inference procedures to tune the parameters of these laboratory methods and accurately synthesize novel DNA sequences with greatly reduced cost compared to exact methods.
Oops I Took A Gradient: Scalable Sampling for Discrete Distributions: Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, Chris J. Maddison
ICML 2021. Long Oral Presentation Outstanding Paper Award Honorable Mention My Talk
We present a new approach to MCMC sampling for discrete distributions. Our approach exploits a ubiquitous structure that exists in many discrete distributions of interest, gradients, which we use to inform proposals for Metropolis-Hastings. This new sampler greatly outperforms prior samplers for discrete distributions like Gibbs and the Hamming Ball sampler which, for the first time, enables the training of Deep EBMs on high dimensional discrete data.
No MCMC for me: Amortized sampling for fast and stable training of energy-based models: Will Grathwohl, Jacob Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, David Duvenaud
We present a new method for training Energy-Based Models. Our method uses a generator to amortize the sampling typically used in EBM training. Key to our approach is a new, fast method to regularize the entropy of latent-variable generators. We demonstrate that training in this way is faster and more stable than MCMC-based training. This leads to improved performance of JEM models and allows JEM to be applied to semi-supervised learning for tabular data, outperforming Virtual Adversarial Training.
Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling: Will Grathwohl, Kuan-Chieh Wang, Jorn-Henrik Jacobsen, David Duvenaud, Richard Zemel.
We present a new method for training and evaluating unnormalized density models. Our method is based on estimating the Stein Discrepancy between our model and the data distribution. Unlike other discrepancy measures, the Stein Discrepancy only requires an unnormalized model and samples from the data distribution to evalaute. We train a neural network to estimate this discrepancy and show that it can be used for goodness-of-fit testing, model evalaution, and model training. Our method greatly outperforms previous kernel-based methods for estimating Stein Discrepancies.
Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One: Will Grathwohl, Jackson Wang, Jorn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, Kevin Swersky.
ICLR 2020. Oral Presentation My Talk.
We show that you can reinterpret standard classification architectures as energy-based generative models and train them as such. Doing this allows us to achieve SOTA performance at BOTH generative and discriminative modeling in a single model. Adding this energy-based training also gives surprising other benefits such as increased calibration, mechanisms for out-of-distribution detection, and adversarial robustness!
Understanding the Limitations of Conditional Generative Models: Ethan Fetaya, Jörn-Henrik Jacobsen, Will Grathwohl, Richard Zemel.
We examine the performance of conditional generative models for discriminative tasks and study why they fail to perform as well as purely discriminative models. We provide theoretical justification for this failing and provide a new dataset which demonstrates our theory.
Invertible Residual Networks: Jens Behrmann*, Will Grathwohl* Ricky T. Q. Chen, David Duvenaud, Jorn-Henrik Jocobsen* (*equal contribution)
ICML 2019. Long Oral Presentation.
We make ResNets inveritible without dimension splitting heuristics. We demonstrate that these models can be used in building state-of-the-art generative and discriminitive models.
Modeling Global Class Structure Leads to Rapid Integration of New Classes: Will Grathwohl, Eleni Triantafillou, Xuechen Li, David Duvenaud and Richard Zemel.
NeurIPS 2018 Workshop on Meta-Learning
NeurIPS 2018 Workshop on Continual Learning
Training Glow with Constant Memory Cost: Xuechen Li, Will Grathwohl.
NIPS 2018 Workshop on Bayesian Deep Learning
Gradient-Based Optimization of Neural Network Architecture: Will Grathwohl, Elliot Creager, Kamyar Ghasemipour, Richard Zemel.
ICLR 2018 Workshop.
Backpropagation through the Void: Optimizing control variates for black-box gradient estimation: Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, David Duvenaud.
NIPS 2017 Deep Reinforcement Learning Symposium. Oral Presentation. Video of my talk found here.
Using and Abusing Gradients for Discrete MCMC and Energy-Based Models: ICLR 2021 Workshop on Energy-Based Models, May 2021. Link.
Using and Abusing Gradients for Discrete MCMC and Energy-Based Models: CMU Artificial Intelligence Seminar Series, Feb 2021. Link.
Your Brain on Energy-Based Models: Seminar on Theoretical Machine Learning, Institute for Advanced Study, March 2020. Video.
Your Classifier is Secretly and Energy-Based Model and You Should Treat it Like One: Generative Models and Uncertainty, Copenhagen, Denmark, October 2019.
A workshop on generative models organized at the Technical University of Denmark.
Awards and Fellowships
Borealis AI Graduate Fellowship: A $50,000, 2 year fellowship funding research in AI. Funded by the Royal Bank of Canada. Huawei Prize: A financial award based on academic and research performance. ICLR 2018 Travel Award Best Paper Award: Symposium on Advances in Approximate Bayesian Inference 2018