- Ph.D student, supervised by Prof. Gennady Pekhimenko
SysLab, Department of Computer Science
University of Toronto
Bahen Centre for Information Technology
40 St. George Street, Room 5206
EMAIL: serailhydra AT cs DOT toronto DOT edu
I am a PhD student in Department of Computer Science in University of Toronto, working with Prof. Gennady Pehkimenko. I am leading the project of TBD (Training Benchmark for DNNs), which focuses on profiling the modern DNN training workloads on various software and hardware environments. In this project we actively collect state-of-the-art DNN models, and build tools to extract important and intuitive performance metrics. This is a long-term project. Please visit our project website if you want to know more details.
I am currently contributing the MLPerf benchmark on the cloud version of Deep Speech 2 inference benchmark. MLPerf is a broad benchmark suite for measuring DNN computation performance of ML frameworks, ML hardware accelerators and ML cloud platforms. It is now the most prestigous benchmark in the community and has contributors from tens of companies and universities.
I am also participating in the project Fiddle in Microsoft Research, working with Amar Phanishayee. My sub-project is using dependency graph analysis to profile large-scale DNN training. The purpose is to answer performance-related what-if questions.
I am generally interested is areas of systems, architectures and machine learning. I am currently working on profiling and optimizations for DNN training on various software and hardware environments.
- Ph.D in University of Toronto (2015.9 to now)
- MSc in McGill University (2013.9 to 2015.8)
- Bachelor in Shanghai Jiaotong University, ACM class (2009.9 to 2013.6)
- Zhu, H., Akrout, M., Zheng, B., Pelegris, A., Phanishayee, A., Schroeder, B., & Pekhimenko, G. (2018). Benchmarking and Analyzing Deep Neural Network Training. In IEEE International Symposium on Workload Characterization 2018. [pdf]
- Zhu, H., Zheng, B., Schroeder, B., Pekhimenko, G., & Phanishayee, A. DNN-Train: Benchmarking and Analyzing DNN Training. In SysML 2018. [pdf]
- El-Sayed, N., Zhu, H., & Schroeder, B. (2017, June). Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations. In Distributed Computing Systems (ICDCS), 2017 IEEE 37th International Conference
- Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T., ... & Zhang, Z. (2013, April). Timestream: Reliable stream computation in the cloud. In Proceedings of the 8th ACM European Conference on Computer Systems