Image of Andrew Li

Andrew Li
Machine Learning PhD Candidate

andrewli[@]cs[.]toronto[.]edu
Google Scholar
GitHub
CV

About me

I am a final-year PhD student in Computer Science at the University of Toronto, supervised by Sheila McIlraith. I am on the job market for research scientist/engineer roles.

My goal is to build reliable, multi-purpose agents that reason like humans in dynamic environments by drawing from techniques in deep reinforcement learning, symbolic reasoning, and language understanding. My current research uses formal languages to study generalization and language grounding in instruction following agents. I am also interested in teaching agents to reason under partial observability.

Previously, I interned at Zoox, where I finetuned large models for autonomous driving through reinforcement learning.

Education

  • Ph.D. in Computer Science (candidate), 2021 - present
    University of Toronto
  • M.Sc. in Computer Science, 2019 - 2021
    University of Toronto
  • B.Math. in Computer Science, Combinatorics & Optimization, 2015 - 2019
    University of Waterloo

Selected Research

  • Ground-Compose-Reinforce: Tasking Reinforcement Learning Agents through Formal Languagepreprint

    Andrew Li, Toryn Klassen, Andrew Wang, Parand Alamdari, Sheila McIlraith

    Abstract: Grounding language in complex perception (e.g. pixels) and action is a key challenge when building situated agents that can interact with humans via language. In past works, this is often solved via manual design of the language grounding or by curating massive datasets relating language to elements of the environment. We propose Ground-Compose-Reinforce, a neurosymbolic framework for grounding formal language from data, and eliciting behaviours by directly tasking RL agents through this language. By virtue of data-driven learning, our framework avoids the manual design of domain-specific elements like reward functions or symbol detectors. By virtue of compositional formal language semantics, our framework achieves data-efficient grounding and generalization to arbitrary language compositions. Experiments on an image-based gridworld and a MuJoCo robotics domain show that our approach reliably maps formal language instructions to behaviours with limited data while end-to-end, data-driven approaches fail.

    Ground compose reinforce figure
  • Reward Machines for Deep RL in Noisy and Uncertain EnvironmentsNeurIPS 2024

    Andrew Li, Zizhao Chen, Pashootan Vaezipoor, Toryn Klassen, Rodrigo Toro Icarte, Sheila McIlraith

    Abstract: Reward Machines provide an automaton-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing the underlying structure of a reward function, they enable the decomposition of an RL task, leading to impressive gains in sample efficiency. Although Reward Machines and similar formal specifications have a rich history of application towards sequential decision-making problems, they critically rely on a ground-truth interpretation of the domain-specific vocabulary that forms the building blocks of the reward function—such ground-truth interpretations are elusive in the real world due in part to partial observability and noisy sensing. In this work, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that exploit task structure under uncertain interpretation of the domain-specific vocabulary. Through theory and experiments, we expose pitfalls in naive approaches to this problem while simultaneously demonstrating how task structure can be successfully leveraged under noisy interpretations of the vocabulary.

    ZARS framework figure
  • Learning Belief Representations for Partially Observable Deep RLICML 2023

    Andrew Wang*, Andrew Li*, Toryn Klassen, Rodrigo Toro Icarte, Sheila McIlraith

    Abstract: Many important real-world Reinforcement Learning (RL) problems involve partial observability and require policies with memory. Unfortunately, standard deep RL algorithms for partially observable settings typically condition on the full history of interactions and are notoriously difficult to train. We propose a novel deep, partially observable RL algorithm based on modelling belief states — a technique typically used when solving tabular POMDPs, but that has traditionally been difficult to apply to more complex environments. Our approach simplifies policy learning by leveraging state information at training time, that may not be available at deployment time. We do so in two ways: first, we decouple belief state modelling (via unsupervised learning) from policy optimization (via RL); and second, we propose a representation learning approach to capture a compact set of reward-relevant features of the state. Experiments demonstrate the efficacy of our approach on partially observable domains requiring information seeking and long-term memory.

    Learning Belief Representations figure
  • Learning to Follow Instructions in Text-Based GamesNeurIPS 2022

    Mathieu Tuli, Andrew Li, Pashootan Vaezipoor, Toryn Klassen, Scott Sanner, Sheila McIlraith

    Abstract: Text-based games present a unique class of sequential decision making problem in which agents interact with a partially observable, simulated environment via actions and observations conveyed through natural language. Such observations typically include instructions that, in a reinforcement learning (RL) setting, can directly or indirectly guide a player towards completing reward-worthy tasks. In this work, we study the ability of RL agents to follow such instructions. We conduct experiments that show that the performance of state-of-the-art text-based game agents is largely unaffected by the presence or absence of such instructions, and that these agents are typically unable to execute tasks to completion. To further study and address the task of instruction following, we equip RL agents with an internal structured representation of natural language instructions in the form of Linear Temporal Logic (LTL), a formal language that is increasingly used for temporally extended reward specification in RL. Our framework both supports and highlights the benefit of understanding the temporal semantics of instructions and in measuring progress towards achievement of such a temporally extended behaviour. Experiments with 500+ games in TextWorld demonstrate the superior performance of our approach.

    Instruction following figure
  • Exploring Long-Horizon Reasoning with Deep RL in Combinatorially Hard TasksDecision Awareness in RL @ ICML 2022

    Andrew Li, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila McIlraith

    Abstract: Deep reinforcement learning has shown promise in discrete domains requiring complex reasoning, including games such as Chess, Go, and Hanabi. However, this type of reasoning is less often observed in long-horizon, continuous domains with high-dimensional observations, where instead RL research has predominantly focused on problems with simple high-level structure (e.g. opening a drawer or moving a robot as fast as possible). Inspired by combinatorially hard optimization problems, we propose a set of robotics tasks which admit many distinct solutions at the high-level, but require reasoning about states and rewards thousands of steps into the future for the best performance. Critically, while RL has traditionally suffered on complex, long-horizon tasks due to sparse rewards, our tasks are carefully designed to be solvable without specialized exploration. Nevertheless, our investigation finds that standard RL methods often neglect long-term effects due to discounting, while general-purpose hierarchical RL approaches struggle unless additional abstract domain knowledge can be exploited.

    Combinatorial tasks figure
  • LTL2Action: Generalizing LTL Instructions for Multi-Task RLICML 2021

    Pashootan Vaezipoor*, Andrew Li*, Rodrigo Toro Icarte, Sheila McIlraith

    Abstract: We address the problem of teaching a deep reinforcement learning (RL) agent to follow instructions in multi-task environments. Instructions are expressed in a well-known formal language – linear temporal logic (LTL) – and can specify a diversity of complex, temporally extended behaviours, including conditionals and alternative realizations. Our proposed learning approach exploits the compositional syntax and the semantics of LTL, enabling our RL agent to learn task-conditioned policies that generalize to new instructions, not observed during training. To reduce the overhead of learning LTL semantics, we introduce an environment-agnostic LTL pretraining scheme which improves sample-efficiency in downstream environments. Experiments on discrete and continuous domains target combinatorial task sets of up to ∼1039 unique tasks and demonstrate the strength of our approach in learning to solve (unseen) tasks, given LTL instructions.

    LTL2Action figure
  • Interpretable Sequence Classification via Discrete OptimizationAAAI 2021

    Maayan Shvo, Andrew Li, Rodrigo Toro Icarte, Sheila McIlraith

    Abstract: Sequence classification is the task of predicting a class label given a sequence of observations. In many applications such as healthcare monitoring or intrusion detection, early classification is crucial to prompt intervention. In this work, we learn sequence classifiers that favour early classification from an evolving observation trace. While many state-of-the-art sequence classifiers are neural networks, and in particular LSTMs, our classifiers take the form of finite state automata and are learned via discrete optimization. Our automata-based classifiers are interpretable—supporting explanation, counterfactual reasoning, and human-in-the-loop modification—and have strong empirical performance. Experiments over a suite of goal recognition and behaviour classification datasets show our learned automata-based classifiers to have comparable test performance to LSTM-based classifiers, with the added advantage of being interpretable.

    AAAI 2021 figure
  • Learning Symbolic Representations for Reinforcement Learning of Non-Markovian BehaviorKR2ML @ NeurIPS 2020

    Phillip Christofferson, Andrew Li, Rodrigo Toro Icarte, Sheila McIlraith

    Abstract: Many real-world reinforcement learning (RL) problems necessitate learning complex, temporally extended behavior that may only receive reward signal when the behavior is completed. If the reward-worthy behavior is known, it can be specified in terms of a non-Markovian reward function—a function that depends on aspects of the state-action history, rather than just the current state and action. Such reward functions yield sparse rewards, necessitating an inordinate number of experiences to find a policy that captures the reward-worthy pattern of behavior. Recent work has leveraged Knowledge Representation (KR) to provide a symbolic abstraction of aspects of the state that summarize reward-relevant properties of the state-action history and support learning a Markovian decomposition of the problem in terms of an automaton over the KR. Providing such a decomposition has been shown to vastly improve learning rates, especially when coupled with algorithms that exploit automaton structure. Nevertheless, such techniques rely on a priori knowledge of the KR. In this work, we explore how to automatically discover useful state abstractions that support learning automata over the state-action history. The result is an end-to-end algorithm that can learn optimal policies with significantly fewer environment samples than state-of-the-art RL on simple non-Markovian domains.

    KR2ML figure
  • Bayesian Network Structure Learning with Side ConstraintsPGM 2018

    Andrew Li, Peter van Beek

    Abstract: Hybrid methods for Bayesian network structure learning that incorporate both observed data and expert knowledge have proven to be important in many fields. Previous studies have presented both exact and approximate hybrid methods for structure learning. In this paper, we propose an approximate method based on local search that is capable of efficiently handling a variety of prior knowledge constraints, including an important class of non-decomposable ancestral constraints that assert indirect causation between random variables. In our experiments, our proposed approximate method is able to significantly outperform an existing state-of-the-art approximate method in finding feasible solutions when hard constraints are imposed. Our approach is able to find nearoptimal networks while scaling up to almost fifty random variables. In contrast, previous exact methods are unable to handle more than twenty random variables. Furthermore, we show that when prior knowledge is integrated, we are often able to produce a network much closer to the ground truth network, particularly when the amount of data is limited.

    PGM 2018 figure