About Me
I am Mohammad Mozaffari, an ML Researcher at ElastixAI. I received my PhD in Computer Science from the University of Toronto, supervised by Professor Maryam Mehri Dehnavi. I got my B.Sc. in Electrical Engineering with a minor degree in Computer Engineering from the University of Tehran.
My research focuses on the "Compression Trinity" for Large Language Models: the interplay of sparsity, quantization, and low-rank approximations to make LLMs faster and smaller. My work has been featured by NVIDIA Research and the official PyTorch blog. You can explore it here: The Compression Trinity for LLMs.
Publications
Invited Talks
- PATCH: Learnable Tile-Level Hybrid Sparsity for LLMs — NVIDIA Research, Seattle Oct 2025
- Compression Trinity: Interplay of Sparsity, Quantization, and Low-Rank Approximation for LLMs — Cerebras, Toronto Mar 2025
- Efficient LLM Training and Inference: Sparsity, Quantization, and Low-Rank Approximation — Google DeepMind, Seattle Mar 2025
- Enabling Semi-structured Sparsity in LLMs — NVIDIA Research, Seattle Mar 2024
- Communication-Efficient Second-Order Optimization Methods — Rutgers University, New Jersey Nov 2023
Media & Outreach
- When Quantization Isn't Enough: Why 2:4 Sparsity Matters — Official PyTorch Blog
-
Guest Interview on Executive Code Podcast — Smaller Models, Same Power: How SLiM Shrinks LLMs Without Retraining Jul 2025
YouTube · Spotify · Apple Podcasts
Mentorship
Mentored 7 undergraduate and master's students on projects related to LLM compression. Two mentees were admitted to Stanford for graduate studies.
Experience
ML Researcher at ElastixAI Dec 2025 – Present
- Research and develop compression techniques for efficient deployment of large language models.
- Investigate Mixture-of-Experts architectures, including token routing, kernel design, and dispatch optimization.
- Collaborate directly with the CTO and CEO on research direction and production integration.
Research Intern at Autodesk Aug 2022 – Dec 2022
- Reduced multi-GPU simulation time from 4h to 3.2h via CUDA kernel optimization and profiling.
- Designed kernel fusion and memory coalescing strategies, reducing bandwidth by 30%.
- Improved inter-GPU synchronization and dataflow using Nsight Systems.