About Me
I am Mohammad Mozaffari, an ML Researcher at ElastixAI and a PhD candiate in the Computer Science Department at the University of Toronto supervised by Professor Maryam Mehri Dehnavi. I got my B.Sc. in Electrical Engineering with a minor degree in Computer Engineering from the University of Tehran.
My research interests broadly span machine learning, optimization, and sparsity. In particular, I'm interested in developing new algorithms that leverage sparsity in the training and inference of large-scale machine learning models. I am also interested in enhancing the distributed second-order optimization methods to improve the convergence rate of the training process.
A significant focus of my recent work revolves around the "Compression Trinity" for Large Language Models (LLMs), exploring the interplay of sparsity, quantization, and low-rank approximations, to make these powerful models more efficient. I've dedicated a separate page to discuss these concepts and my related publications. You can explore it here: The Compression Trinity for LLMs.
Publications
Experience
ML Researcher at ElastixAI
Dec 2025 – Present
Manager: Mahyar Najibi
- Developing and implementing advanced model compression techniques for Large Language Models, focusing on the integration of sparsity, quantization, and low-rank approximations.
- Optimizing training and inference pipelines to improve the computational efficiency of large-scale machine learning models on distributed GPU systems.
- Collaborating with the engineering team to deploy high-performance kernels that reduce memory footprint and latency for production-level LLMs.
Research Intern at Autodesk
Aug 2022 - Dec 2022
Manager: Massimiliano Meneghin
- Proposed and implemented CUDA optimizations, reducing simulation time for a multi-GPU fluid dynamics model from 4 hours to 3.2 hours through code profiling and kernel-level enhancements.
- Designed and applied kernel fusion strategies, reducing memory bandwidth consumption by 30% and enhancing computational efficiency in large-scale simulations.
- Collaborated with a team of 3 engineers, utilizing NVIDIA Nsight Systems/Compute to identify and resolve performance bottlenecks, optimizing data flow across multi-GPU nodes and reducing latency by 20%.
Research Intern at the University of Tehran
Aug 2020 - Jul 2021
Supervisor: Professor Maryam Sabbaghiyan
- Developed a mathematical model for spatial-temporal variations in user behavior, improving accuracy of network traffic predictions by 15% in simulations.
- Implemented machine learning techniques to optimize bandwidth allocation, resulting in a 10% reduction in data transfer latency in test scenarios.
- Gained proficiency in Python and multi-thread programming, creating parallel data processing scripts that reduced analysis time from 2 hours to 90 minutes for large datasets.
Media & Outreach
Guest Interview on Executive Code Podcast
Jul 2025
I was featured on the Executive Code Podcast where I discussed my work on SLiM and broader strategies for compressing large language models through sparsity, quantization, and low-rank approximation.