Compression Trinity for LLMs

Welcome to the Compression Trinity Hub

As Large Language Models (LLMs) continue to grow in size and capability, their deployment and inference become increasingly resource-intensive. Enter the Compression Trinity: Sparsity, Quantization, and Low-Rank Approximation—three powerful methods that, when combined, can significantly reduce model size, accelerate computations, and lower energy consumption, all while maintaining high performance.

Through my research, I've developed novel approaches that integrate these techniques to push the boundaries of what's possible in model compression. Here, you'll find insights from my papers—MKOR, SLoPe, and SLiM—each contributing unique advancements to the field. Whether you're a researcher, engineer, or enthusiast, join me in exploring how these methods are shaping the future of efficient AI.

Sparsity

Sparsity is a cornerstone of model compression, focusing on eliminating less critical weights or activations to create a leaner, more efficient model. In the realm of LLMs, where parameter counts can reach billions, sparsity offers a path to drastic reductions in memory footprint and computational demands. By leveraging hardware that supports sparse operations, we can achieve significant speedups in both training and inference.

My work in this area includes:

SLoPe: Introduces a novel "Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining" method. By employing a double-pruned backward pass and integrating low-rank adapters in the final stages of pretraining, SLoPe enhances the accuracy of sparse LLMs while accelerating their training and inference.
SLiM: Incorporates hardware-friendly sparsity patterns, such as NVIDIA's 2:4 sparsity, to ensure that the benefits of sparsity translate into real-world performance gains on modern GPUs.

These contributions demonstrate how advanced sparsity techniques can be tailored to the unique challenges of LLMs, paving the way for more scalable and efficient AI systems.

Quantization

Quantization is a powerful technique that reduces the precision of model parameters and activations, transforming high-precision floating-point numbers into lower-bit representations like 8-bit integers. This not only compresses the model size but also accelerates inference by enabling faster arithmetic operations, especially on hardware optimized for low-precision computations.

However, quantization can introduce accuracy degradation, particularly when applied uniformly across the model. My research tackles this challenge head-on:

SLiM: Features "SLIM-Quant," a probabilistic approach to uniform quantization that reformulates the non-convex quantization problem into a convex optimization problem. This innovation significantly reduces quantization errors, achieving accuracy levels comparable to more complex group quantization methods while maintaining computational efficiency.

By pushing the boundaries of what's possible with uniform quantization, SLiM makes it feasible to deploy highly compressed LLMs without sacrificing performance.

Low-Rank Approximation

Low-Rank Approximation (LoRA) is a sophisticated compression technique that decomposes large weight matrices into products of smaller matrices, effectively capturing the essential information with fewer parameters. This method is particularly valuable for LLMs, where the sheer size of weight matrices can be a major bottleneck.

In my research, LoRA plays a crucial role in mitigating the accuracy loss introduced by other compression techniques:

SLiM: Introduces "SLIM-LoRA," a one-shot low-rank adaptation method that compensates for the errors introduced by quantization and sparsity. By using a novel saliency function, SLIM-LoRA mathematically derives optimal low-rank adapter values without the need for expensive retraining.
SLoPe: Utilizes "Lazy Low-Rank Adapters" added in the final stages of pretraining to boost model capacity without significantly increasing computational overhead.

Additionally, while not directly focused on LoRA, MKOR employs rank-1 updates in its optimization process, which can be seen as a form of low-rank adjustment to the model's parameters.

These approaches showcase the versatility of LoRA in enhancing model efficiency and accuracy, making it an indispensable part of the compression toolkit.

Stay Updated with the Latest Findings

The field of model compression is rapidly evolving, with new techniques and optimizations emerging regularly. To keep you informed about the latest advancements and insights from my research, I regularly publish blog posts on this website. These posts delve into recent findings, experimental results, and practical tips for applying compression techniques to your own models.

Check out the latest posts on the side bar on this page!

Acknowledgements

I would like to express my sincere gratitude to Dr. Amir Yazdanbakhsh and Prof. Maryam Mehri Dehnavi for their constant support and invaluable feedback on my research. Their guidance and insights have been instrumental in shaping and advancing this work in LLM compression.