About Me


I'm a third year PhD student at the University of Toronto and Vector Institute, supervised by Roger Grosse and Geoffrey Hinton.

I'm interested in understanding the computational mechanisms that give rise to intelligence. To this end, I'm currently working towards improving the generalization power and robustness of deep learning models, as well as understanding different generalization patterns that enable strong out-of-distribution performance. I'm also interested in developing techniques that'll let humans receive useful and verifiable information from powerful but untrustworthy agents by building verification at the heart of decision processes.

Resume   Linkedin   Twitter


Education


Engineering Science, University of Toronto; Bachelor of Applied Science and Engineering
  - Specialized in Robotics.
  - 3.98 CGPA, First in Graduating Class in Engineering Science, 2019.

Publications


(NeurIPS 2022) Path Independent Equilibrium Models Can Better Exploit Test-Time Computation

Cem Anil, Ashwini Pokle, Kaiqu Liang, Johannes Treutlein, Yuhuai Wu, Shaojie Bai, J. Zico Kolter, Roger Grosse

Designing networks capable of attaining better performance with an increased inference budget is important to facilitate generalization to harder problem instances. Recent efforts have shown promising results in this direction by making use of depth-wise recurrent networks. We show that a broad class of architectures named \textit{equilibrium models} display strong upwards generalization, and find that stronger performance on harder examples (which require more iterations of inference to get correct) strongly correlates with the \emph{path independence} of the system---its tendency to converge to the same steady-state behaviour regardless of initialization, given enough computation. Experimental interventions made to promote path independence result in improved generalization on harder problem instances, while those that penalize it degrade this ability. Path independence analyses are also useful on a per-example basis: for equilibrium models that have good in-distribution performance, path independence on out-of-distribution samples strongly correlates with accuracy. Our results help explain why equilibrium models are capable of strong upwards generalization and motivates future work that harnesses path independence as a general modelling principle to facilitate scalable test-time usage.
(NeurIPS 2022) Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misrai

Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.
(NeurIPS 2022) Exploring Length Generalization in Large Language Models

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur

The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These include theorem proving, solving quantitative mathematics problems, and reading/summarizing novels. In this paper, we run careful empirical studies exploring the length generalization capabilities of transformer-based language models. We first establish that naively finetuning transformers on length generalization tasks shows significant generalization deficiencies independent of model scale. We then show that combining pretrained large language models' in-context learning abilities with scratchpad prompting (asking the model to output solution steps before producing an answer) results in a dramatic improvement in length generalization. We run careful failure analyses on each of the learning modalities and identify common sources of mistakes that highlight opportunities in equipping language models with the ability to generalize to longer problems.
Learning to Give Checkable Answers with Prover-Verifier Games

Cem Anil, Guodong Zhang, Yuhuai Wu, Roger Grosse

Our ability to know when to trust the decisions made by machine learning systems has not kept up with the staggering improvements in their performance, limiting their applicability in high-stakes domains. We introduce Prover-Verifier Games (PVGs), a game-theoretic framework to encourage learning agents to solve decision problems in a verifiable manner. The PVG consists of two learners with competing objectives: a trusted verifier network tries to choose the correct answer, and a more powerful but untrusted prover network attempts to persuade the verifier of a particular answer, regardless of its correctness. The goal is for a reliable justification protocol to emerge from this game. We analyze variants of the framework, including simultaneous and sequential games, and narrow the space down to a subset of games which provably have the desired equilibria. We develop instantiations of the PVG for two algorithmic tasks, and show that in practice, the verifier learns a robust decision rule that is able to receive useful and reliable information from an untrusted prover. Importantly, the protocol still works even when the verifier is frozen and the prover's messages are directly optimized to convince the verifier.

(NeurIPS 2021) - Learning to Elect

Cem Anil,* Xuchan Bao,*

Voting systems have a wide range of applications including recommender systems, web search, product design and elections. Limited by the lack of general-purpose analytical tools, it is difficult to hand-engineer desirable voting rules for each use case. For this reason, it is appealing to automatically discover voting rules geared towards each scenario. In this paper, we show that set-input neural network architectures such as Set Transformers, fully-connected graph networks and DeepSets are both theoretically and empirically well-suited for learning voting rules. Our learning to elect framework can discover near-optimal voting rules that maximize different notions of social welfare and mimic a number of existing voting rules to compelling accuracy while remaining robust against drastic distribution shifts that include different elector distributions and larger elections.

(NeurIPS 2019) - Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks

Qiyang Li,* Saminul Haque,* Cem Anil, James Lucas, Roger Grosse, Jörn-Henrik Jacobsen

Lipschitz constraints under L2 norm on deep neural networks are useful for provable adversarial robustness bounds, stable training, and Wasserstein distance estimation. In our earlier paper, we identify a key obstacle for training networks with a strict Lipschitz constraint - gradient norm attenuation - and develop methods to overcome this in the fully connected setting. In this paper, we extend our methods to convolutional networks. The architecture we develop can achieve tight Lipschitz constraints using an expressive parameterization of orthogonal convolutions, which we refer to as Block Convolutional Orthogonal Parameterization. Our model achieves state-of-the-art performance on provable robustness for image classification tasks. ( * equal contribution)

(ICML 2019) - Sorting Out Lipschitz Function Approximation

Cem Anil,* James Lucas,* Roger Grosse

Training neural networks with a desired Lipschitz constant is useful for provable adversarial robustness, Wasserstein distance estimation and generalization. The challenge is to do this while retaining expressive power. In this paper, we first identify a pathology shared by previous attempts to build provably Lipschitz architectures, then develop a new architecture that overcomes this pathology. Our architecture makes use of a new activation function based on sorting - GroupSort. Empirically, GroupSort networks achieve tighter estimates of Wasserstein distance and can achieve provable adversarial robustness guarantees with little cost to accuracy. ( * equal contribution)


(ICLR 2019) - TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

Sicong Huang, Qiyang Li, Cem Anil, Xuchan Bao, Sageev Oore, Roger Grosse

In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. We introduce TimbreTron, which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.

Workshop Publications


(CVPR 2019) - Training Deep Networks With Synthetic Data: Bridging the Reality Gap by Domain Randomization

Jonathan Tremblay,* Aayush Prakash,* David Acuna,* Mark Brophy,* Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, Stan Birchfield

We present a system for training deep neural networks for object detection using synthetic images. To handle the variability in real-world data, the system relies upon the technique of domain randomization, in which the parameters of the simulator - such as lighting, pose, object textures, etc. - are randomized in non-realistic ways to force the neural network to learn the essential features of the object of interest. We explore the importance of these parameters, showing that it is possible to produce a network with compelling performance using only non-artistically-generated synthetic data. With additional fine-tuning on real data, the network yields better performance than using real data alone. This result opens up the possibility of using inexpensive synthetic data for training neural networks while avoiding the need to collect large amounts of hand-annotated real-world data or to generate high-fidelity synthetic worlds - both of which remain bottlenecks for many applications. ( * equal contribution)

Contact


Email: anilcem_at_cs_toronto_edu