Jonathan Peter Lorraine

I'm a Ph.D. candidate in machine learning at the University of Toronto and the Vector Institute. My research focuses on hyperparameter optimization, learning in games, and - more generally - nested optimization. During the Summer of 2022, I'll be a research intern at NVIDIA. Previously, I was a research intern at Google and Facebook (now Meta) AI.

Curriculum Vitae


Advisor: David Duvenaud

Location: Vector Institute, MaRS Center, Toronto


Jon's Google Scholar Jon's Twitter Jon's Linkedin


  • CSC401/CSC2511: Natural Language Computing (Winter 2022)
  • CSC413/CSC2516: Neural Networks and Deep Learning (Winter 2021)
  • CSC2515: Machine Learning (Winter 2021)
  • CSC311: Introduction to Machine Learning (Fall 2020)
  • CSC413/CSC2516: Neural Networks and Deep Learning (Winter 2020)
  • CSC2547: Learning to Search (Fall 2019)
  • CSC412/CSC2506: Probabilistic Learning and Reasoning (Winter 2019)
  • CSC411/CSC2515: Introduction to Machine Learning (Fall 2018)

  • Service

  • I was/am a reviewer for ICML2022/21, AISTATS2022/21, ICLR2021/20, AAMAS2022, and NeurIPS2021/20/18.

  • Papers

    Link to implicitBias paper pdf On Implicit Bias in Overparameterized Bilevel Optimization

    Many problems in machine learning involve bilevel optimization (BLO), including hyperparameter optimization, meta-learning, and dataset distillation. Bilevel problems involve inner and outer parameters, each optimized for its own objective. Often, at least one of the two levels is underspecified and there are multiple ways to choose among equivalent optima. Inspired by recent studies of the implicit bias induced by optimization algorithms in single-level optimization, we investigate the implicit bias of different gradient-based algorithms for jointly optimizing the inner and outer parameters. We delineate two standard BLO methods—cold-start and warm-start BLO—and show that the converged solution or long-run behavior depends to a large degree on these and other algorithmic choices, such as the hypergradient approximation. We also show that the solutions from warm-start BLO can encode a surprising amount of information about the outer objective, even when the outer optimization variables are low-dimensional. We believe that implicit bias deserves as central a role in the study of bilevel optimization as it has attained in the study of single-level neural net optimization.

    Paul Vicol, Jonathan Lorraine, Fabian Pedregosa, David Duvenaud, Roger Grosse
    ICML 2022
    proceedings | workshop talk | slides | poster | bibtex

    Link to LyapForDiversity paper pdf Lyapunov Exponents for Diversity in Differentiable Games

    Ridge Rider (RR) is an algorithm for finding diverse solutions to optimization problems by following eigenvectors of the Hessian ("ridges"). RR is designed for conservative gradient systems (i.e., settings involving a single loss function), where it branches at saddles - easy-to-find bifurcation points. We generalize this idea to non-conservative, multi-agent gradient systems by proposing a method - denoted Generalized Ridge Rider (GRR) - for finding arbitrary bifurcation points. We give theoretical motivation for our method by leveraging machinery from the field of dynamical systems. We construct novel toy problems where we can visualize new phenomena while giving insight into high-dimensional problems of interest. Finally, we empirically evaluate our method by finding diverse solutions in the iterated prisoners' dilemma and relevant machine learning problems including generative adversarial networks.

    Jonathan Lorraine, Paul Vicol, Jack Parker-Holder, Tal Kachman, Luke Metz, Jakob Foerster
    Oral AAMAS, 2022
    proceedings | arXiv | talk | slides short, long | bibtex | twitter thread

    Link to CM paper pdf Complex Momentum for Optimization in Games

    We generalize gradient descent with momentum for optimization in differentiable games to have complex-valued momentum. We give theoretical motivation for our method by proving convergence on bilinear zero-sum games for simultaneous and alternating updates. Our method gives real-valued parameter updates, making it a drop-in replacement for standard optimizers. We empirically demonstrate that complex-valued momentum can improve convergence in realistic adversarial games — like generative adversarial networks — by showing we can find better solutions with an almost identical computational cost. We also show a practical generalization to a complex-valued Adam variant, which we use to train BigGAN to better inception scores on CIFAR-10.

    Jonathan Lorraine, David Acuna, Paul Vicol, David Duvenaud
    AISTATS, 2022
    proceedings | arXiv | talk | slides | poster | bibtex | twitter thread

    Link to ICGN paper pdf Input Convex Gradient Networks

    The gradients of convex functions are expressive models of non-trivial vector fields. For example, Brenier's theorem yields that the optimal transport map between any two measures on Euclidean space under the squared distance is realized as a convex gradient, which is a key insight used in recent generative flow models. In this paper, we study how to model convex gradients by integrating a Jacobian-vector product parameterized by a neural network, which we call the Input Convex Gradient Network (ICGN). We theoretically study ICGNs and compare them to taking the gradient of an Input-Convex Neural Network (ICNN), empirically demonstrating that a single layer ICGN can fit a toy example better than a single layer ICNN. Lastly, we explore extensions to deeper networks and connections to constructions from Riemannian geometry.

    Jack Richter-Powell, Jonathan Lorraine, Brandon Amos
    Spotlight NeurIPS OTML Workshop, 2021
    arXiv | talk slideslive, youtube | slides | poster | bibtex

    Link to metaPretrain paper pdf Meta-Learning to Improve Pre-Training

    Pre-training (PT) followed by fine-tuning (FT) is an effective method for training neural networks, and has led to significant performance improvements in many domains. PT can incorporate various design choices such as task and data reweighting strategies, augmentation policies, and noise models, all of which can significantly impact the quality of representations learned. The hyperparameters introduced by these strategies therefore must be tuned appropriately. However, setting the values of these hyperparameters is challenging. Most existing methods either struggle to scale to high dimensions, are too slow and memory-intensive, or cannot be directly applied to the two-stage PT and FT learning process. In this work, we propose an efficient, gradient-based algorithm to meta-learn PT hyperparameters. We formalize the PT hyperparameter optimization problem and propose a novel method to obtain PT hyperparameter gradients by combining implicit differentiation and backpropagation through unrolled optimization. We demonstrate that our method improves predictive performance on two real-world domains. First, we optimize high-dimensional task weighting hyperparameters for multitask pre-training on protein-protein interaction graphs and improve AUROC by up to 3.9%. Second, we optimize a data augmentation neural network for self-supervised PT with SimCLR on electrocardiography data and improve AUROC by up to 1.9%.

    Aniruddh Raghu, Jonathan Lorraine, Simon Kornblith, Matthew McDermott, David Duvenaud
    NeurIPS, 2021
    proceedings | OpenReview | arXiv | talk | slides | poster | bibtex | twitter thread

    Link to UsingBifurcations paper pdf Using Bifurcations for Diversity in Differentiable Games

    Ridge Rider (RR) is an algorithm for finding diverse solutions to optimization problems by following eigenvectors of the Hessian ("ridges''). RR is designed for conservative gradient systems (i.e., settings involving a single loss function), where it branches at saddles - the only relevant bifurcation points. We generalize this idea to non-conservative, multi-agent gradient systems by identifying new types of bifurcation points and proposing a method to follow eigenvectors with complex eigenvalues. We give theoretical motivation for our method - denoted Game Ridge Rider (GRR) - by leveraging machinery from the field of dynamical systems. Finally, we empirically motivate our method by constructing novel toy problems where we can visualize new phenomena and by finding diverse solutions in the iterated prisoners' dilemma, where existing methods often converge to a single solution mode.

    Jonathan Lorraine, Jack Parker-Holder, Paul Vicol, Aldo Pacchiano, Luke Metz, Tal Kachman, Jakob Foerster
    ICML BOFM Workshop, 2021
    paper 1, 2, 3 | talk | slides | poster | bibtex

    Link to IFT paper pdf Optimizing Millions of Hyperparameters by Implicit Differentiation

    We propose an algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations. We present results on the relationship between the IFT and differentiating through optimization, motivating our algorithm. We use the proposed approach to train modern network architectures with millions of weights and millions of hyperparameters. We learn a data-augmentation network - where every weight is a hyperparameter tuned for validation performance - that outputs augmented training examples; we learn a distilled dataset where each feature in each datapoint is a hyperparameter; and we tune millions of regularization hyperparameters. Jointly tuning weights and hyperparameters with our approach is only a few times more costly in memory and compute than standard training.

    Jonathan Lorraine, Paul Vicol, David Duvenaud
    AISTATS, 2020
    proceedings | arXiv | slides | slideslive talk 1, 2 | poster | code | blog | bibtex | twitter thread

    Link to JacNet paper pdf JacNet: Learning Functions with Structured Jacobians

    Neural networks are trained to learn an approximate mapping from an input domain to a target domain. Often, incorporating prior knowledge about the true mapping is critical to learning a useful approximation. With current architectures, it is difficult to enforce structure on the derivatives of the input-output mapping. We propose to directly learn the Jacobian of the input-output function with a neural network, which allows easy control of derivative. We focus on structuring the derivative to allow invertibility, and also demonstrate other useful priors can be enforced, such as k-Lipschitz. Using this approach, we are able to learn approximations to simple functions which are guaranteed to be invertible, and easily compute the inverse. We also show a similar results for 1-Lipschitz functions.

    Jonathan Lorraine, Safwan Hossain
    ICML INNF Workshop, 2019
    paper 1, 2 | poster | bibtex

    Link to STN paper pdf Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

    Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We aim to adapt regularization hyperparameters for neural networks by fitting compact approximations to the best-response function, which maps hyperparameters to optimal weights and biases. We show how to construct scalable best-response approximations for neural networks by modeling the best-response as a single network whose hidden units are gated conditionally on the regularizer. We justify this approximation by showing the exact best-response for a shallow linear network with L2-regularized Jacobian can be represented by a similar gating mechanism. We fit this model using a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response around the current hyperparameters and optimizing the hyperparameters using the approximate best-response function. Unlike other gradient-based approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities. Because the hyperparameters are adapted online, our approach discovers hyperparameter schedules that can outperform fixed hyperparameter values. Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks (STNs).

    Matthew MacKay, Paul Vicol, Jonathan Lorraine, David Duvenaud, Roger Grosse
    ICLR, 2019
    OpenReview | arXiv | slides | poster | code | blog | bibtex

    Link to UNAS paper pdf Understanding Neural Architecture Search

    Automatic methods for generating state-of-the-art neural network architectures without human experts have generated significant attention recently. This is because of the potential to remove human experts from the design loop which can reduce costs and decrease time to model deployment. Neural architecture search (NAS) techniques have improved significantly in their computational efficiency since the original NAS was proposed. This reduction in computation is enabled via weight sharing such as in Efficient Neural Architecture Search (ENAS). However, recently a body of work confirms our discovery that ENAS does not do significantly better than random search with weight sharing, contradicting the initial claims of the authors. We provide an explanation for this phenomenon by investigating the interpretability of the ENAS controller’s hidden state. We are interested in seeing if the controller embeddings are predictive of any properties of the final architecture - for example, graph properties like the number of connections, or validation performance. We find models sampled from identical controller hidden states have no correlation in various graph similarity metrics. This failure mode implies the RNN controller does not condition on past architecture choices. Importantly, we may need to condition on past choices if certain connection patterns prevent vanishing or exploding gradients. Lastly, we propose a solution to this failure mode by forcing the controller’s hidden state to encode pasts decisions by training it with a memory buffer of previously sampled architectures. Doing this improves hidden state interpretability by increasing the correlation controller hidden states and graph similarity metrics.

    George Adam, Jonathan Lorraine
    arXiv | bibtex

    Link to hyperOpt2017 paper pdf Stochastic Hyperparameter Optimization Through Hypernetworks

    Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters. We give a method to collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our process trains a neural network to output approximately optimal weights as a function of hyperparameters. We show that our technique converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.

    Jonathan Lorraine, David Duvenaud
    NIPS Meta-Learning Workshop, 2017
    arXiv | slides | poster | code | bibtex | twitter thread


  • Information Theory Tutorial - CSC413/2516 (Winter 2021)
  • Bayesian Optimization and Hyperparameter Optimization Tutorial - CSC2515 (Winter 2021)
  • CNN Tutorial - CSC2515 (Winter 2021)
  • JAX Tutorial - CSC2515 (Winter 2021)
  • Linear Algebra Tutorial - CSC2515 (Winter 2021)
  • Optimization Tutorial - CSC311 (Fall 2020)
  • Last updated July 14th, 2022