We propose a method to select relevant development tasks to accurately assess the impact of AutoML system changes on holdout tasks with different distributions.

Our goal is to assess if AutoML system changes - i.e., to the search space or hyperparameter optimization - will improve the final model's performance on production tasks.
However, we cannot test the changes on production tasks.
Instead, we only have access to limited descriptors about tasks that our AutoML system previously executed, like the number of data points or features.
We also have a set of development tasks to test changes, ex., sampled from OpenML with no usage constraints.
However, the development and production task distributions are different leading us to pursue changes that only improve development and not production.
This paper proposes a method to leverage descriptor information about AutoML production tasks to select a filtered subset of the most relevant development tasks.
Empirical studies show that our filtering strategy improves the ability to assess AutoML system changes on holdout tasks with different distributions than development.

On Implicit Bias in Overparameterized Bilevel Optimization

We characterize implicit regularization from various bilevel optimization methods.

Many problems in machine learning involve bilevel optimization (BLO), including hyperparameter optimization, meta-learning, and dataset distillation.
Bilevel problems involve inner and outer parameters, each optimized for its own objective.
Often, at least one of the two levels is underspecified and there are multiple ways to choose among equivalent optima.
Inspired by recent studies of the implicit bias induced by optimization algorithms in single-level optimization, we investigate the implicit bias of different gradient-based algorithms for jointly optimizing the inner and outer parameters.
We delineate two standard BLO methods—cold-start and warm-start BLO—and show that the converged solution or long-run behavior depends to a large degree on these and other algorithmic choices, such as the hypergradient approximation.
We also show that the solutions from warm-start BLO can encode a surprising amount of information about the outer objective, even when the outer optimization variables are low-dimensional.
We believe that implicit bias deserves as central a role in the study of bilevel optimization as it has attained in the study of single-level neural net optimization.

Lyapunov Exponents for Diversity in Differentiable Games

We give a method for finding multiple solutions in games by branching optimization near bifurcations, found with Lyapunov exponent based objectives.

Ridge Rider (RR) is an algorithm for finding diverse solutions to optimization problems by following eigenvectors of the Hessian ("ridges").
RR is designed for conservative gradient systems (i.e., settings involving a single loss function), where it branches at saddles - easy-to-find bifurcation points.
We generalize this idea to non-conservative, multi-agent gradient systems by proposing a method - denoted Generalized Ridge Rider (GRR) - for finding arbitrary bifurcation points.
We give theoretical motivation for our method by leveraging machinery from the field of dynamical systems.
We construct novel toy problems where we can visualize new phenomena while giving insight into high-dimensional problems of interest.
Finally, we empirically evaluate our method by finding diverse solutions in the iterated prisoners' dilemma and relevant machine learning problems including generative adversarial networks.

We provide a generalization of gradient descent with momentum, that can improve convergence in adversarial games with near identical compute cost.

We generalize gradient descent with momentum for optimization in differentiable games to have complex-valued momentum.
We give theoretical motivation for our method by proving convergence on bilinear zero-sum games for simultaneous and alternating updates.
Our method gives real-valued parameter updates, making it a drop-in replacement for standard optimizers.
We empirically demonstrate that complex-valued momentum can improve convergence in realistic adversarial games — like generative adversarial networks — by showing we can find better solutions with an almost identical computational cost.
We also show a practical generalization to a complex-valued Adam variant, which we use to train BigGAN to better inception scores on CIFAR-10.

We model convex gradients by integrating a JVP parameterized by a neural network.

The gradients of convex functions are expressive models of non-trivial vector fields.
For example, Brenier's theorem yields that the optimal transport map between any two measures on Euclidean space under the squared distance is realized as a convex gradient, which is a key insight used in recent generative flow models.
In this paper, we study how to model convex gradients by integrating a Jacobian-vector product parameterized by a neural network, which we call the Input Convex Gradient Network (ICGN).
We theoretically study ICGNs and compare them to taking the gradient of an Input-Convex Neural Network (ICNN), empirically demonstrating that a single layer ICGN can fit a toy example better than a single layer ICNN.
Lastly, we explore extensions to deeper networks and connections to constructions from Riemannian geometry.

We propose a gradient-based method to meta-learn pre-training hyperparameters by combining iterative and implicit differentiation.

Pre-training (PT) followed by fine-tuning (FT) is an effective method for training neural networks, and has led to significant performance improvements in many domains.
PT can incorporate various design choices such as task and data reweighting strategies, augmentation policies, and noise models, all of which can significantly impact the quality of representations learned.
The hyperparameters introduced by these strategies therefore must be tuned appropriately.
However, setting the values of these hyperparameters is challenging.
Most existing methods either struggle to scale to high dimensions, are too slow and memory-intensive, or cannot be directly applied to the two-stage PT and FT learning process.
In this work, we propose an efficient, gradient-based algorithm to meta-learn PT hyperparameters.
We formalize the PT hyperparameter optimization problem and propose a novel method to obtain PT hyperparameter gradients by combining implicit differentiation and backpropagation through unrolled optimization.
We demonstrate that our method improves predictive performance on two real-world domains.
First, we optimize high-dimensional task weighting hyperparameters for multitask pre-training on protein-protein interaction graphs and improve AUROC by up to 3.9%.
Second, we optimize a data augmentation neural network for self-supervised PT with SimCLR on electrocardiography data and improve AUROC by up to 1.9%.

Using Bifurcations for Diversity in Differentiable Games

We give a method for finding multiple solutions in games by branching optimization near bifurcations.

Ridge Rider (RR) is an algorithm for finding diverse solutions to optimization problems by following eigenvectors of the Hessian ("ridges'').
RR is designed for conservative gradient systems (i.e., settings involving a single loss function), where it branches at saddles - the only relevant bifurcation points.
We generalize this idea to non-conservative, multi-agent gradient systems by identifying new types of bifurcation points and proposing a method to follow eigenvectors with complex eigenvalues.
We give theoretical motivation for our method - denoted Game Ridge Rider (GRR) - by leveraging machinery from the field of dynamical systems.
Finally, we empirically motivate our method by constructing novel toy problems where we can visualize new phenomena and by finding diverse solutions in the iterated prisoners' dilemma, where existing methods often converge to a single solution mode.

Optimizing Millions of Hyperparameters by Implicit Differentiation

We give a method to jointly tune hyperparameters and parameters which is only a few times more costly compute than standard training.

We propose an algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations.
We present results on the relationship between the IFT and differentiating through optimization, motivating our algorithm.
We use the proposed approach to train modern network architectures with millions of weights and millions of hyperparameters.
We learn a data-augmentation network - where every weight is a hyperparameter tuned for validation performance - that outputs augmented training examples; we learn a distilled dataset where each feature in each datapoint is a hyperparameter; and we tune millions of regularization hyperparameters.
Jointly tuning weights and hyperparameters with our approach is only a few times more costly in memory and compute than standard training.

JacNet: Learning Functions with Structured Jacobians

We provide a method to learn functions by parameterizing their derivative, allowing us to easily enforce constraints on the learned function.

Neural networks are trained to learn an approximate mapping from an input domain to a target domain.
Often, incorporating prior knowledge about the true mapping is critical to learning a useful approximation.
With current architectures, it is difficult to enforce structure on the derivatives of the input-output mapping.
We propose to directly learn the Jacobian of the input-output function with a neural network, which allows easy control of derivative.
We focus on structuring the derivative to allow invertibility, and also demonstrate other useful priors can be enforced, such as k-Lipschitz.
Using this approach, we are able to learn approximations to simple functions which are guaranteed to be invertible, and easily compute the inverse.
We also show a similar results for 1-Lipschitz functions.

Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

We provide a hypernetwork architecture that scales hyperparameter optimization to modern networks.

Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters.
We aim to adapt regularization hyperparameters for neural networks by fitting compact approximations to the best-response function, which maps hyperparameters to optimal weights and biases.
We show how to construct scalable best-response approximations for neural networks by modeling the best-response as a single network whose hidden units are gated conditionally on the regularizer.
We justify this approximation by showing the exact best-response for a shallow linear network with L2-regularized Jacobian can be represented by a similar gating mechanism.
We fit this model using a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response around the current hyperparameters and optimizing the hyperparameters using the approximate best-response function.
Unlike other gradient-based approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities.
Because the hyperparameters are adapted online, our approach discovers hyperparameter schedules that can outperform fixed hyperparameter values.
Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems.
We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks (STNs).

We investigate failure modes of neural architecture search techniques, and propose potential solutions by modulating hidden state interpretability.

Automatic methods for generating state-of-the-art neural network architectures without human experts have generated significant attention recently.
This is because of the potential to remove human experts from the design loop which can reduce costs and decrease time to model deployment.
Neural architecture search (NAS) techniques have improved significantly in their computational efficiency since the original NAS was proposed.
This reduction in computation is enabled via weight sharing such as in Efficient Neural Architecture Search (ENAS).
However, recently a body of work confirms our discovery that ENAS does not do significantly better than random search with weight sharing, contradicting the initial claims of the authors.
We provide an explanation for this phenomenon by investigating the interpretability of the ENAS controller’s hidden state.
We are interested in seeing if the controller embeddings are predictive of any properties of the final architecture - for example, graph properties like the number of connections, or validation performance.
We find models sampled from identical controller hidden states have no correlation in various graph similarity metrics.
This failure mode implies the RNN controller does not condition on past architecture choices.
Importantly, we may need to condition on past choices if certain connection patterns prevent vanishing or exploding gradients.
Lastly, we propose a solution to this failure mode by forcing the controller’s hidden state to encode pasts decisions by training it with a memory buffer of previously sampled architectures.
Doing this improves hidden state interpretability by increasing the correlation controller hidden states and graph similarity metrics.

Stochastic Hyperparameter Optimization Through Hypernetworks

We give an algorithm to learn a differentiable loss for hyperparameters, which can scale to thousands of dimensions.

Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters.
We give a method to collapse this nested optimization into joint stochastic optimization of weights and hyperparameters.
Our process trains a neural network to output approximately optimal weights as a function of hyperparameters.
We show that our technique converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks.
We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.