Wenyuan Zeng

  • PhD Student · University of Toronto
  • Research Scientist · Uber ATG

Hi, I'm a PhD student at Department of Computer Science, University of Toronto since 2017, advised by Prof. Raquel Urtasun. I'm also a Research Scientist at Uber Advanced Technologies Group (Toronto), doing research related to autonomous vehicles, and a member of Vector Institute.

Prior to join University of Toronto, I obtained my bachelor's degree in Mathematics and Physics from Tsinghua University. During my undergraduate study, I was very fortunate to work with Prof. Zhiyuan Liu and Prof. Sanja Fidler.

I am interested in computer vision, robotics and machine learning, specifically on topics such as 3D perception, motion forecasting and motion planning. My recent work involves developing interpretable and robust algorithms for autonomous driving.


  • 3 papers (1 oral, 1 spotlight) have been accepted by ECCV2020.

  • 1 paper has been accepted by IROS2020.

  • 2 papers (1 oral) have been accepted by CVPR2020.

  • 1 paper (1 oral) has been accepted by CVPR2019.

  • 2 papers have been accepted by ICML2018.


(* indicates equal contributions.)
DSDNet: Deep Structured self-Driving Network
Wenyuan Zeng, Shenlong Wang, Renjie Liao, Yun Chen, Bin Yang, Raquel Urtasun
European Conference Computer Vision (ECCV), Glasgow, 2020.
show abstract / paper

Unified framework for multi-modal social consistent prediction and safe planning under uncertainties.

In this paper, we propose the Deep Structured self-Driving Network (DSDNet), which performs object detection, motion prediction, and motion planning with a single neural network. Towards this goal, we develop a deep structured energy based model which considers the interactions between actors and produces socially consistent multimodal future predictions. Furthermore, DSDNet explicitly exploits the predicted future distributions of actors to plan a safe maneuver by using a structured planning cost. Our sample-based formulation allows us to overcome the difficulty in probabilistic inference of continuous random variables. Experiments on a number of large-scale self driving datasets demonstrate that our model significantly outperforms the state-of-the-art.

V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction
Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, Raquel Urtasun
European Conference Computer Vision (ECCV), Glasgow, 2020.
Oral Presentation

show abstract / paper

Vehicle to vehicle communication network for self-driving.

In this paper, we explore the use of vehicle-to-vehicle (V2V) communication to improve the perception and motion forecasting performance of self-driving vehicles. By intelligently aggregating the information received from multiple nearby vehicles, we can observe the same scene from different viewpoints. This allows us to see through occlusions and detect actors at long range, where the observations are very sparse or non-existent. We also show that our approach of sending compressed deep feature map activations achieves high accuracy while satisfying communication bandwidth requirements.

Weakly-supervised 3D Shape Completion in the Wild
Jiayuan Gu, Wei-Chiu Ma, Sivabalan Manivasagam, Wenyuan Zeng, Zihao Wang, Yuwen Xiong, Hao Su, Raquel Urtasun
European Conference Computer Vision (ECCV), Glasgow, 2020.
Spotlight Presentation

show abstract / paper

Learning 3D shape reconstruction from unaligned and noisy partial point clouds.

3D shape completion for real data is important but challenging, since partial point clouds acquired by real-world sensors are usually sparse, noisy and unaligned. Different from previous methods, we address the problem of learning 3D complete shape from unaligned and real-world partial point clouds. To this end, we propose an unsupervised method to estimate both 3D canonical shape and 6-DoF pose for alignment, given multiple partial observations associated with the same instance. The network jointly optimizes canonical shapes and poses with multi-view geometry constraints during training, and can infer the complete shape given a single partial point cloud. Moreover, learned pose estimation can facilitate partial point cloud registration. Experiments on both synthetic and real data show that it is feasible and promising to learn 3D shape completion through large-scale data without shape and pose supervision.

End-to-end Contextual Perception and Prediction with Interaction Transformer
Luke Lingyun Li, Bin Yang, Ming Liang, Wenyuan Zeng, Mengye Ren, Sean Segal, Raquel Urtasun
International Conference on Intelligent Robots and Systems (IROS), Las Vegas, 2020.
show abstract / paper

Adapt Transformer to model multi-agent interactions in trajectory prediction.

In this paper, we tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. Towards this goal, we design a novel approach that explicitly takes into account the interactions between actors. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture, which we call the Interaction Transformer. Importantly, our model can be trained end-to-end, and runs in real-time. We validate our approach on two challenging real-world datasets: ATG4D and nuScenes. We show that our approach can outperform the state-of-the-art on both datasets. In particular, we significantly improve the social compliance between the estimated future trajectories, resulting in far fewer collisions between the predicted actors.

LidarSIM: Realistic LiDAR Simulation by Leveraging the Real World
Sivabalan Manivasagam, Shenlong Wang, Kelvin Wong, Wenyuan Zeng, Mikita Sazanovich, Shuhan Tan, Bin Yang, Wei-Chiu Ma, Raquel Urtasun
Computer Vision and Pattern Recognition (CVPR), Seattle, 2020.
Oral Presentation

show abstract / paper

High-fidelity LiDAR simulator for close-loop evaluating autonomous driving systems.

We tackle the problem of producing realistic simulations of LiDAR point clouds, the sensor of preference for most self-driving vehicles. We argue that, by leveraging real data, we can simulate the complex world more realistically compared to employing virtual worlds built from CAD/procedural models. Towards this goal, we first build a large catalog of 3D static maps and 3D dynamic objects by driving around several cities with our self-driving fleet. We can then generate scenarios by selecting a scene from our catalog and "virtually" placing the self-driving vehicle (SDV) and a set of dynamic objects from the catalog in plausible locations in the scene. To produce realistic simulations, we develop a novel simulator that captures both the power of physics-based and learning-based simulation. We first utilize ray casting over the 3D scene and then use a deep neural network to produce deviations from the physics-based simulation, producing realistic LiDAR point clouds. We showcase LiDARsim's usefulness for perception algorithms-testing on long-tail events and end-to-end closed-loop evaluation on safety-critical scenarios.

PnPNet: End-to-End Perception and Prediction with Tracking in the Loop
Ming Liang*, Bin Yang*, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, Raquel Urtasun
Computer Vision and Pattern Recognition (CVPR), Seattle, 2020.
show abstract / paper

The first single neural network that solves detection->tracking->prediction in an end-to-end manner.

We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential sensor data, and outputs at each time step object tracks and their future trajectories. The key component is a novel tracking module that generates object tracks online from detections and exploits trajectory level features for motion forecasting. Specifically, the object tracks get updated at each time step by solving both the data association problem and the trajectory estimation problem. Importantly, the whole model is end-to-end trainable and benefits from joint optimization of all tasks. We validate PnPNet on two large-scale driving datasets, and show significant improvements over the state-of-the-art with better occlusion recovery and more accurate future prediction.

End-to-end Interpretable Neural Motion Planner
Wenyuan Zeng*, Wenjie Luo*, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, Raquel Urtasun
Computer Vision and Pattern Recognition (CVPR), Long Beach, 2019.
Oral Presentation

show abstract / paper / talk

End-to-end motion planner with interpretable intermediate representations.

In this paper, we propose a neural motion planner for learning to drive autonomously in complex urban scenarios that include traffic-light handling, yielding, and interactions with multiple road-users. Towards this goal, we design a holistic model that takes as input raw LIDAR data and a HD map and produces interpretable intermediate representations in the form of 3D detections and their future trajectories, as well as a cost volume defining the goodness of each position that the self-driving car can take within the planning horizon. We then sample a set of diverse physically possible trajectories and choose the one with the minimum learned cost. Importantly, our cost volume is able to naturally capture multi-modality. We demonstrate the effectiveness of our approach in real-world driving data captured in several cities in North America. Our experiments show that the learned cost volume can generate safer planning than all the baselines.

Learning to Reweight Examples for Robust Deep Learning
Mengye Ren, Wenyuan Zeng, Bin Yang, Raquel Urtasun
International Conference on Machine Learning (ICML), Stockholm, 2018.
show abstract / paper / talk / code

Learning to reweight training examples in an online manner to overcome noisy or class-imblanced datasets.

Deep neural networks have been shown to be very powerful modeling tools for many supervised learning tasks involving complex input patterns. However, they can also easily overfit to training set biases and label noises. In addition to various regularizers, example reweighting algorithms are popular solutions to these problems, but they require careful tuning of additional hyperparameters, such as example mining schedules and regularization hyperparameters. In contrast to past reweighting methods, which typically consist of functions of the cost value of each example, in this work we propose a novel meta-learning algorithm that learns to assign weights to training examples based on their gradient directions. To determine the example weights, our method performs a meta gradient descent step on the current mini-batch example weights (which are initialized from zero) to minimize the loss on a clean unbiased validation set. Our proposed method can be easily implemented on any type of deep network, does not require any additional hyperparameter tuning, and achieves impressive performance on class imbalance and corrupted label problems where only a small amount of clean validation data is available.

Differentiable Compositional Kernel Learning for Gaussian Processes
Shengyang Sun, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, Roger Grosse
International Conference on Machine Learning (ICML), Stockholm, 2018.
show abstract / paper / talk / code

Neural Kernel Network (NKN), a flexible family of kernels represented by a neural network.

The generalization properties of Gaussian processes depend heavily on the choice of kernel, and this choice remains a dark art. We present the Neural Kernel Network (NKN), a flexible family of kernels represented by a neural network. The NKN architecture is based on the composition rules for kernels, so that each unit of the network corresponds to a valid kernel. It can compactly approximate compositional kernel structures such as those used by the Automatic Statistician (Lloyd et al., 2014), but because the architecture is differentiable, it is end-to-end trainable with gradient-based optimization. We show that the NKN is universal for the class of stationary kernels. Empirically we demonstrate pattern discovery and extrapolation abilities of NKN on several tasks that depend crucially on identifying the underlying structure, including time series and texture extrapolation, as well as Bayesian optimization.

Incorporating Relation Paths in Neural Relation Extraction
Wenyuan Zeng, Yankai Lin, Zhiyuan Liu, Maosong Sun
Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, 2017.
show abstract / paper / code

A path-based neural relation extractor which infers relations using inference chains.

Distantly supervised relation extraction has been widely used to find novel relational facts from plain text. To predict the relation between a pair of two target entities, existing methods solely rely on those direct sentences containing both entities. In fact, there are also many sentences containing only one of the target entities, which provide rich and useful information for relation extraction. To address this issue, we build inference chains between two target entities via intermediate entities, and propose a path-based neural relation extraction model to encode the relational semantics from both direct sentences and inference chains. Experimental results on real-world datasets show that, our model can make full use of those sentences containing only one target entity, and achieves significant and consistent improvements on relation extraction as compared with baselines.