Jonathan Peter Lorraine

Picture of Jonathan Lorraine

I'm a Ph.D. student in machine learning at the University of Toronto and at the Vector Institute. My research interests are in meta-learning, learning with multiple agents, the intersection of machine learning with game theory, and - more generally - nested optimization.

I recently finished my M.Sc.A.C. with a focus in data science.

Curriculum Vitae


Location: Vector Institute, MaRS Center, 661 University Ave., Suite 710, Toronto, ON M5G 1M1

Advisor: David Duvenaud


Jon's Github Jon's Twitter Jon's Linkedin Jon's Google Scholar


  • Assistant - CSC2547: Learning to Search (Fall 2019)
  • Assistant - CSC412/CSC2506: Probabilistic Learning and Reasoning (Winter 2019)
  • Assistant - CSC411/CSC2515: Introduction to Machine Learning (Fall 2018)
  • Assistant - CSC165: Mathematical Expressions and Reasoning for Computer Science (Fall 2016)

  • Service

  • Reviewer for International Conference on Learning Representations 2020
  • Reviewer for Smooth Games Optimization and Machine Learning Workshop at NIPS2018

  • Papers

    Link to IFT paper Optimizing Millions of Hyperparameters by Implicit Differentiation

    We propose an algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations. We present results on the relationship between the IFT and differentiating through optimization, motivating our algorithm. We use the proposed approach to train modern network architectures with millions of weights and millions of hyperparameters. We learn a data-augmentation network - where every weight is a hyperparameter tuned for validation performance - that outputs augmented training examples; we learn a distilled dataset where each feature in each datapoint is a hyperparameter; and we tune millions of regularization hyperparameters. Jointly tuning weights and hyperparameters with our approach is only a few times more costly in memory and compute than standard training.

    Jonathan Lorraine, Paul Vicol, David Duvenaud
    arXiv | bibtex | slides | blog
    Link to JacNet paper JacNet: Learning Functions with Structured Jacobians

    Neural networks are trained to learn an approximate mapping from an input domain to a target domain. Often, incorporating prior knowledge about the true mapping is critical to learning a useful approximation. With current architectures, it is difficult to enforce structure on the derivatives of the input-output mapping. We propose to directly learn the Jacobian of the input-output function with a neural network, which allows easy control of derivative. We focus on structuring the derivative to allow invertibility, and also demonstrate other useful priors can be enforced, such as k-Lipschitz. Using this approach, we are able to learn approximations to simple functions which are guaranteed to be invertible, and easily compute the inverse. We also show a similar results for 1-Lipschitz functions.

    Jonathan Lorraine, Safwan Hossain
    ICML INNF Workshop, 2019.
    paper | bibtex | poster
    Link to STN paper Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

    Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We aim to adapt regularization hyperparameters for neural networks by fitting compact approximations to the best-response function, which maps hyperparameters to optimal weights and biases. We show how to construct scalable best-response approximations for neural networks by modeling the best-response as a single network whose hidden units are gated conditionally on the regularizer. We justify this approximation by showing the exact best-response for a shallow linear network with L2-regularized Jacobian can be represented by a similar gating mechanism. We fit this model using a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response around the current hyperparameters and optimizing the hyperparameters using the approximate best-response function. Unlike other gradient-based approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities. Because the hyperparameters are adapted online, our approach discovers hyperparameter schedules that can outperform fixed hyperparameter values. Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks (STNs).

    Matthew MacKay, Paul Vicol, Jonathan Lorraine, David Duvenaud, Roger Grosse
    International Conference on Learning Representations, 2019.
    arXiv | bibtex | slides | poster | code | blog
    Link to UNAS paper Understanding Neural Architecture Search

    Automatic methods for generating state-of-the-art neural network architectures without human experts have generated significant attention recently. This is because of the potential to remove human experts from the design loop which can reduce costs and decrease time to model deployment. Neural architecture search (NAS) techniques have improved significantly in their computational efficiency since the original NAS was proposed. This reduction in computation is enabled via weight sharing such as in Efficient Neural Architecture Search (ENAS). However, recently a body of work confirms our discovery that ENAS does not do significantly better than random search with weight sharing, contradicting the initial claims of the authors. We provide an explanation for this phenomenon by investigating the interpretability of the ENAS controller’s hidden state. We are interested in seeing if the controller embeddings are predictive of any properties of the final architecture - for example, graph properties like the number of connections, or validation performance. We find models sampled from identical controller hidden states have no correlation in various graph similarity metrics. This failure mode implies the RNN controller does not condition on past architecture choices. Importantly, we may need to condition on past choices if certain connection patterns prevent vanishing or exploding gradients. Lastly, we propose a solution to this failure mode by forcing the controller’s hidden state to encode pasts decisions by training it with a memory buffer of previously sampled architectures. Doing this improves hidden state interpretability by increasing the correlation controller hidden states and graph similarity metrics.

    George Adam, Jonathan Lorraine
    arXiv | bibtex
    Link to hyperOpt2017 paper Stochastic Hyperparameter Optimization Through Hypernetworks

    Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters. We give a method to collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our process trains a neural network to output approximately optimal weights as a function of hyperparameters. We show that our technique converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.

    Jonathan Lorraine, David Duvenaud
    NIPS Meta-Learning Workshop, 2017.
    arXiv | bibtex | slides | poster | code


    Voronoi LCBO opt Toronto Maximizing the Trading Area of a new Facility

    Designed an algorithm for finding a point to add a Voronoi diagram, with an associated Voronoi cell that has maximal area. The algorithm was applied to compute an optimal LCBO placement in Toronto. The locations value was confirmed by LCBO representatives. Work was completed as a research internship and supported by NSERC.

    Dmitry Krass, Atsuo Suzuki
    Link to edge demand paper On Covering Location Problems on Networks with Edge Demand

    This paper considers two covering location problems on a network where the demand is distributed along the edges. The first is the classical maximal covering location problem. The second problem is the obnoxious version where the coverage should be minimized subject to some distance constraints between the facilities. It is first shown that the finite dominating set for covering problems with nodal demand does not carry over to the case of edge based demands. Then, a solution approach for the single facility problem is presented. Afterwards, the multi-facility problem is discussed and several discretization results for tree networks are presented for the case that the demand is constant on each edge; unfortunately, these results do not carry over to general networks as a counter example shows. To tackle practical problems, the conditional version of the problem is considered and a greedy heuristic is introduced. Afterwards, numerical tests are presented to underline the practicality of the algorithms proposed and to understand the conditions under which accurate modeling of edge-based demand and a continuous edge-based location space are particularly important.

    Oded Berman, Jörg Kalcsics, Dmitry Krass
    Link to opt design Optimizing Facility Location and Design

    In this paper we develop a novel methodology to simultaneously optimize locations and designs for a set of new facilities facing competition from some preexisting facilities. Known as the Competitive Facility Location and Design Problem (CFLDP), this model was previously only solvable when a limited number of design scenarios was pre-specified. Our methodology removes this limitation and allows for solving of much more realistic models. The results are illustrated with a small case study.

    Robert Aboolian, Oded Berman, Dmitry Krass
    Last updated November 14th, 2019