Xujie Si's Homepage

Xujie Si

Assistant Professor

Department of Computer Science
University of Toronto

Email: email.
CV · Google Scholar · DBLP

About

I am an Assistant Professor in the Department of Computer Science at the University of Toronto. I am also a faculty affiliate at the Vector Institute and an affiliate member at Mila - Quebec AI Institute, where I hold a Canada CIFAR AI Chair.

I spent two wonderful years (2021-2022) as an Assistant Professor in the School of Computer Science at McGill University. Before that, I finished my Ph.D. in Computer and Information Science at the University of Pennsylvania, advised by Mayur Naik. I obtained my M.S. in computer science from Vanderbilt University and my B.E. (with Honors) from Nankai Unversity.

News

Oct. 2024 Two works on code refinement with LLMs and neuro-symbolic AI got accepted at NeurIPS'24.
July 2024 Zhaoyu's work on deep learning for theorem proving got accepted at COLM'24.
Apr. 2024 Logan and Zhaoyu's work on autoformalizing Euclidean geometry got accepted at ICML'24.
Sep. 2023 Sissi and Zhaoyu's work on learning reliable logical ruless got accepted at NeurIPS'23.
Sep. 2023 Qidong and Allen's work on fuzzing dynamic deep learning compilers got accepted at APLAS'23.
Apr. 2023 Allen's work on learning reliable neural specifications got accepted at ICML'23 (oral).

Research

My research to date has focused on improving software quality by advancing the state-of-the-art in static analysis, verification, synthesis, and testing using AI-based techniques ranging from symbolic reasoning, constraint solving, statistical, and probabilistic models, to recent approaches for deep learning and reinforcement learning. At the same time, these program reasoning challenges serve as a rich resource of inspirations for me to advance AI techniques, e.g. scalable inference, logical rule learnging, representation learning, and reinforcement learning with extremely sparse rewards.

Combining Statistical and Logical Methods in Program Reasoning

Real-world programs are large and complex, which involves many third party libraries that may lack specification or even source code, and new features or patches are being introduced regularly. Plus well-known limitations such as undecidability, static analysis tools are forced to make approximations, which in turn result in large numbers of false alarms. My research combines statistical and logical methods to address these complexities and uncertainties.

Automatically learning API specifications from "big code". [USENIX Security'16]
This work uses statistical methods to infer API specifications from complex software systems like Linux kenerl, OpenSSL library, PHP and Python. The resulting open-source prototype APISan has found 76 previously unknown API misuse bugs in these systems. APISan does not need access to API's source code or binary; instead, it observes how and in which context APIs are used.
Improving analysis accuracy by learning from user feedback. [OOPSLA'17, MAPL17, PLDI'19]
A sound static analysis produces false alarms that are unavoidable by the analysis designer. The analysis user, who usually implements the code, has much more information than the analysis designer. Incorporating feedback from the users opens up a new dimension to improve the analysis. However, user can occasionally make mistakes. The insight is to reduce rule-based static analysis into probabilistic models, such as Markov logic network and Bayesian network, so that even noisy feedback can improve analysis accuracy significantly.

Scalable Inference and Rule Learning

Inference and learning are two key components of machine learning applications. After reducing static analysis into probabilistic models, user feedback is incorporated by performing Maximum A Posteriori probability estimate, or MAP inference. Off-the-shelf inference engines cannot scale to static analysis applications. My research improves inference by a systematic constraint solving process. Furthermore, inference helps to learn rule weights but not rules that usually designed by experts. To learn logical rules from data, my research proposes a scalable logic program synthesis framework, which not only improves static analysis but also applies to other domains like big-data analytics and software defined networks.

Scalable inference via systematic constraint solving. [CP'16, CAV'17]
MAP inference of Markov logic network is essentially a MaxSAT problem, which can be solved in two phases: first, ground logical rules into a set of weighted constraints; then solve weighted constraints by a MaxSAT solver. However, naively doing so will be extremely inefficient. Both phases can be optimized. From the perspective of application, constraints do not have to be solved entirely, which allows us to have a combined top-down (exploits laziness) and bottom-up (exploits eagerness) grounding. From the perspective of the underlying solver, which is invoked many times instead of only once, so that incremental solving is feasible and efficient.
Synthesizing rich declarative logic programs. [FSE'18]
Declarative logic programs, in particular Datalog programs, have witnessed promising applications in various domains including static analysis due to its rich expressivity, and also for the exact same reason, its synthesis is a fairly challenging task. This work proposes a systematic template augmentation technique, which enables synthesis of arbitrary rules but without significantly enlarging the search space. Furthermore, we synergistically combines a bi-directional search over version space with active learning so that efficient search is achieved and the number of required examples is minimized.
Scalable synthesis via numerical relaxation. [ML4P'18, IJCAI'19]
Inspired by the fact that numerical relaxation technique is widely used and remarkably successful in optimization problems, this work proposes a numerical relaxation technique for logic programs, which is fundamentally different from traditional search based approaches. This work demonstrates that gradient-based and stochastic approaches can be used to learn logic programs very efficiently.

Deep Learning for Program Reasoning

Deep learning sparks a remarkable revolution in many challenging fields of science and engineering. Deep learning combined with reinforcement learning significantly outperforms human experts on video and board games by directly learning from raw pixels and self-play. However, can this technique help to solve programming challenges? My thesis work aims to answer this very question, which involves addressing two fundamental AI challenges: representation learning of structured data with rich semantics, and the interplay between neural networks and symbolic reasoning.

Program verification by deep reinforcement learning. [NeurIPS'18, CAV'20]
This work concerns attacking the fundamental challenge of automated software verification, which is finding a strong enough loop invariant for a given loop. We develop an end-to-end learning framework, Code2Inv, which takes a piece of code as input, then automatically learns and improves a neural policy by interacting with a checker, and finally produces an invariant so that the verification can be done successfully. To achive so, we prose a novel representation learning technique for programs so that the semantics (rather than syntax) can be captured by neural networks, and two reward mechansims so that reinforcement learning could make progress from extremely sparse reward.
Meta-learning on synthesis. [ICLR'19]
This work investigates the capability of deep learning and reinforcement learning in a much broader domain, namely syntax guided program synthesis (SyGuS), where each synthesis task has its own domain specific language, making manually designing heuristics for each task prohibitive. This also makes learning across different tasks extremely difficult as generalizing across different languages (i.e. learn from one language and then test on a different language) is near the limit of machine learning and has been rarely explored so far. Our attempt in this meta-learning direction and demonstrates quite promising results — the learned neural representation as well as policy can be transferred to (i.e. help to accelerate) synthesis tasks that use similar but different grammars.
Learning-aided reasoning. [FMCAD'21]
Though end-to-end learning has significant potential as shown by my recent research, it is impractical to replace classic approaches with end-to-end learning for various reasons including interpretability, training difficulty, and perhaps more importantly data inefficiency. Instead, deep learning should be integrated with classic approaches so that there is no need to learn everything from scratch. I envision that the future program reasoning techniques and the underlying constraint solving techniques (e.g. SAT, SMT) will be equipped with a learning component, which can be automatically and continuously improved. One of my ongoing work is introducing learning-based techniques into the state-of-the-art software verification and testing tool chain such as SeaHorn and KLEE.

Learning and Reasoning over Perceptual Data

Unartificial intelligence, like human intelligence, often directly process perceptual data (e.g., images, sounds, handwritten texts). Although these data do have interesting structures and rich semantics, automatically extracting them out remains an unsolved challenge. Do we (as homo sapiens) need to solve this challenge? The answer is unfortunately yes, even if every homo sapien masters Computational Thinking, e.g., becomes an excellent algorithm designer and professional programmer, because algorithm design and programming will only work after some "seeminly easy" manual pre-processing. This line of my research is to tackle this "seemingly easy" part jointly with algorithmic learning (i.e., learning efficient and interpretable algorithms).

Learning and reasoning over raw pixels. [NeurIPS'21a, NeurIPS'21b]
Deep neural networks, especially convolutational neural networks (CNN), have achieved remarkable successes in processing images. Beyond object recognition, images often contain rich semantics (e.g., physics laws, traffic laws, a lovely story, etc.). Neural approaches like CNNs are unlikely to capture such rich semantics. We believe either more specialized (e.g., logic insipired) architectures or a neuro-symboliic design which directly incorporates a logical component are needed. SG-SANet (NeurIPS'21a) is a prototype that learn and solve constraints from raw pixels in an end-to-end fashion, which is possible due to one key component called Symbol Grounding, which explicitly encourages (but without direct supervision) the model to associate a group of pixels with a unique symbol. Scallop (NeurIPS'21b) is another prototype that integerates neural components with probabilistic logical rules. Scallop enables logical reasoning over raw pixels as well as an external knowledge base, and some immediate benefits are interpretable predictions and great data efficiency.