As deep learning models nowadays are widely adopted by both cloud services and edge devices, reducing the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators (e.g., NVIDIA GPUs and Google TPUs) and the rapidly growing number of operators. Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations (e.g., double buffering). In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering directly in the tensor programs. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity (e.g., allowing program-statement-level optimizations). We call the proposed method the task-mapping programming paradigm. In addition, we propose a new post-scheduling fusion optimization that allows developers to focus on scheduling every single operator and automates the fusion after scheduling. It greatly reduces the engineering efforts for operator fusion. Our proposed paradigm also constructs an efficient hardware-centric schedule space, which is agnostic to the program input size and greatly reduces the tuning time. With the proposed paradigm, we implement a deep learning compiler Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average). It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively. We open-sourced hidet at https://www.github.com/hidet-org/hidet.
@inproceedings{hidet2023ding,
author = {Ding, Yaoyao and Yu, Cody Hao and Zheng, Bojian and Liu, Yizhi and Wang, Yida and Pekhimenko, Gennady},
title = {Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs},
isbn = {9781450399166},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3575693.3575702},
doi = {10.1145/3575693.3575702},
booktitle = {Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
pages = {370–384},
year = {2023},
numpages = {15},
keywords = {tensor computation, deep learning systems, programming models, compilation, systems for machine learning},
location = {Vancouver, BC, Canada},
series = {ASPLOS 2023},
}
2021
IOS: Inter-Operator Scheduler for CNN Acceleration
To accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization. However, a single operator can no longer fully utilize the available parallelism given the rapid advances in high-performance hardware, resulting in a large gap between the peak performance and the real performance. This performance gap is more severe under smaller batch sizes. In this work, we extensively study the parallelism between operators and propose Inter-Operator Scheduler (IOS) to automatically schedule multiple operators’ parallel execution through a novel dynamic programming algorithm. IOS consistently outperforms state-of-the-art libraries (e.g., TensorRT) by 1.1 to 1.5x on modern CNN benchmarks.
@inproceedings{ios2021ding,
author = {Ding, Yaoyao and Zhu, Ligeng and Jia, Zhihao and Pekhimenko, Gennady and Han, Song},
booktitle = {Proceedings of Machine Learning and Systems},
editor = {Smola, A. and Dimakis, A. and Stoica, I.},
pages = {167--180},
title = {IOS: Inter-Operator Scheduler for CNN Acceleration},
volume = {3},
year = {2021},
}
2020
GAN Compression: Efficient Architectures for Interactive Conditional GANs
Conditional Generative Adversarial Networks (cGANs) have enabled controllable image synthesis for many vision and graphics applications. However, recent cGANs are 1-2 orders of magnitude more compute-intensive than modern recognition CNNs. For example, GauGAN consumes 281G MACs per image, compared to 0.44G MACs for MobileNet-v3, making it difficult for interactive deployment. In this work, we propose a general-purpose compression framework for reducing the inference time and model size of the generator in cGANs. Directly applying existing compression methods yields poor performance due to the difficulty of GAN training and the differences in generator architectures. We address these challenges in two ways. First, to stabilize GAN training, we transfer knowledge of multiple intermediate representations of the original model to its compressed model and unify unpaired and paired learning. Second, instead of reusing existing CNN designs, our method finds efficient architectures via neural architecture search. To accelerate the search process, we decouple the model training and search via weight sharing. Experiments demonstrate the effectiveness of our method across different supervision settings, network architectures, and learning methods. Without losing image quality, we reduce the computation of CycleGAN by 21x, Pix2pix by 12x, MUNIT by 29x, and GauGAN by 9x, paving the way for interactive image synthesis.
@inproceedings{li2020gan,
title = {GAN Compression: Efficient Architectures for Interactive Conditional GANs},
author = {Li, Muyang and Lin, Ji and Ding, Yaoyao and Liu, Zhijian and Zhu, Jun-Yan and Han, Song},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2020},
}
2019
CityFlow: A Multi-Agent Reinforcement Learning Environment for Large Scale City Traffic Scenario
Traffic signal control is an emerging application scenario for reinforcement learning. Besides being as an important problem that affects people’s daily life in commuting, traffic signal control poses its unique challenges for reinforcement learning in terms of adapting to dynamic traffic environment and coordinating thousands of agents including vehicles and pedestrians. A key factor in the success of modern reinforcement learning relies on a good simulator to generate a large number of data samples for learning. The most commonly used open-source traffic simulator SUMO is, however, not scalable to large road network and large traffic flow, which hinders the study of reinforcement learning on traffic scenarios. This motivates us to create a new traffic simulator CityFlow with fundamentally optimized data structures and efficient algorithms. CityFlow can support flexible definitions for road network and traffic flow based on synthetic and real-world data. It also provides user-friendly interface for reinforcement learning. Most importantly, CityFlow is more than twenty times faster than SUMO and is capable of supporting city-wide traffic simulation with an interactive render for monitoring. Besides traffic signal control, CityFlow could serve as the base for other transportation studies and can create new possibilities to test machine learning methods in the intelligent transportation domain.
@inproceedings{10.1145/3308558.3314139,
author = {Zhang, Huichu and Feng, Siyuan and Liu, Chang and Ding, Yaoyao and Zhu, Yichen and Zhou, Zihan and Zhang, Weinan and Yu, Yong and Jin, Haiming and Li, Zhenhui},
title = {CityFlow: A Multi-Agent Reinforcement Learning Environment for Large Scale City Traffic Scenario},
year = {2019},
isbn = {9781450366748},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3308558.3314139},
doi = {10.1145/3308558.3314139},
booktitle = {The World Wide Web Conference},
pages = {3620–3624},
numpages = {5},
keywords = {Microscopic Traffic Simulation, Reinforcement Learning Platform, Mobility},
location = {San Francisco, CA, USA},
series = {WWW '19},
}