Mohammad Mozaffari

About Me

I am Mohammad Mozaffari, an ML Researcher at ElastixAI. I received my PhD in Computer Science from the University of Toronto, supervised by Professor Maryam Mehri Dehnavi. I got my B.Sc. in Electrical Engineering with a minor degree in Computer Engineering from the University of Tehran.

My research focuses on the "Compression Trinity" for Large Language Models: the interplay of sparsity, quantization, and low-rank approximations to make LLMs faster and smaller. My work has been featured by NVIDIA Research and the official PyTorch blog. You can explore it here: The Compression Trinity for LLMs.

Publications

Invited Talks

PATCH: Learnable Tile-Level Hybrid Sparsity for LLMs — NVIDIA Research, Seattle Oct 2025
Compression Trinity: Interplay of Sparsity, Quantization, and Low-Rank Approximation for LLMs — Cerebras, Toronto Mar 2025
Efficient LLM Training and Inference: Sparsity, Quantization, and Low-Rank Approximation — Google DeepMind, Seattle Mar 2025
Enabling Semi-structured Sparsity in LLMs — NVIDIA Research, Seattle Mar 2024
Communication-Efficient Second-Order Optimization Methods — Rutgers University, New Jersey Nov 2023

Media & Outreach

When Quantization Isn't Enough: Why 2:4 Sparsity Matters — Official PyTorch Blog
Guest Interview on Executive Code Podcast — Smaller Models, Same Power: How SLiM Shrinks LLMs Without Retraining Jul 2025
YouTube · Spotify · Apple Podcasts

Mentorship

Mentored 7 undergraduate and master's students on projects related to LLM compression. Two mentees were admitted to Stanford for graduate studies.

Experience

ML Researcher at ElastixAI Dec 2025 – Present

Research and develop compression techniques for efficient deployment of large language models.
Investigate Mixture-of-Experts architectures, including token routing, kernel design, and dispatch optimization.
Collaborate directly with the CTO and CEO on research direction and production integration.

Research Intern at Autodesk Aug 2022 – Dec 2022

Reduced multi-GPU simulation time from 4h to 3.2h via CUDA kernel optimization and profiling.
Designed kernel fusion and memory coalescing strategies, reducing bandwidth by 30%.
Improved inter-GPU synchronization and dataflow using Nsight Systems.