Lunjun Zhang

I am a PhD student in the Machine Learning Group at University of Toronto, advised by Prof. Raquel Urtasun.

I also work as a researcher at Waabi to develop self-driving technology for long-haul trucking.

Previously, I did my undergrad in Engineering Science at University of Toronto, and spent my summers interning at Vector Institute, Montréal Institute of Learning Algorithms, and Uber Advanced Technologies Group.

Email  /  Google Scholar  /  Twitter  /  Github

profile photo

I work at the intersection of robot learning and unsupervised learning. I am fascinated by how intelligent agents are able to learn from increasingly weaker forms of human supervision, and I aim to devise learning algorithms for robotics whose performance automatically improves given more data and compute without being bottlenecked by human-in-the-loop labeling (such as object annotations for perception or handcrafted reward signals for control). Unsupervised iterative self-improvement seems to be the key to true scalability, evidenced by PageRank, (masked) language modeling, and AlphaZero; what would be the equivalent for robotics?

My long-term goal is to build robots that learn and think for themselves, and create strong AI through embodied intelligence. Achieving this goal would require innovations on multiple fronts: re-formulating many aspects of intelligence into a single all-encompassing objective function such that few-shot and zero-shot capabilities can arise; improving the data infrastructure (e.g. deploying robot fleets, leveraging simulation, mining the Internet) required to bootstrap common sense; and designing agents that learn from large amounts of unlabeled, passive experience in a way that takes advantage of the scaling laws.

Unsupervised Reinforcement Learning
Understanding Hindsight Goal Relabeling Requires Rethinking Divergence Minimization
Lunjun Zhang, Bradly Stadie
arXiv preprint, 2022

"How can RL agents learn useful behaviors from unlabeled reward-free trajectories, similar to how NLP models such as BERT and GPT-3 are able to learn language from unlabeled text corpus?"

We give a probabilistic interpretation and first-principled derivation of the reward function in Hindsight Experience Replay for the first time, and illustrate how its underlying mechanism of imitation learning enables agents to use arbitrary trajectories as sub-optimal demonstrations. Our framework reveals a largely unexplored design space for goal-reaching algorithms.
World Model as a Graph: Learning Latent Landmarks for Planning
Lunjun Zhang, Ge Yang, Bradly Stadie
International Conference on Machine Learning (ICML), 2021 (Long Talk)
paper / poster / code / website

"How can we learn world models that endow agents with the ability to do temporally extended reasoning?"

We demonstrate how to combine RL with graph search to do long-horizon planning on a learnable graph-structured world model.