The Compression Trinity Framework
Large Language Models (LLMs) have become foundational to modern AI, powering everything from chatbots to code generation. But their remarkable capabilities come at a steep cost: a single inference pass through a 70B-parameter model requires hundreds of gigabytes of memory and trillions of floating-point operations. Training is orders of magnitude more expensive still. This makes compression not just a nice-to-have, but a prerequisite for practical deployment.
Three families of compression techniques have emerged as dominant: sparsity (removing parameters), quantization (reducing parameter precision), and low-rank approximations (factoring weight matrices into smaller ones). Historically, these methods have been studied and applied in isolation. This is a mistake.
The core argument of the Compression Trinity framework is that these three pillars are fundamentally complementary because they target distinct hardware bottlenecks:
- Sparsity primarily reduces the computational (FLOP) load.
- Quantization primarily reduces memory bandwidth requirements.
- Low-rank approximations exploit and compress parameter redundancy.
Applying any single pillar in isolation quickly hits diminishing returns. Pushing sparsity too high (e.g., 87.5% to achieve 8× compression) destroys accuracy because the remaining weights cannot compensate. Pushing quantization too aggressively (e.g., 2-bit) collapses the representation capacity. But combining moderate sparsity with moderate quantization can achieve the same compression ratio with far less accuracy loss, because the error sources are largely orthogonal.
The Failure of Isolation
The following table demonstrates this failure concretely. At an equivalent ~8× compression ratio on LLaMA-2-7B, single-pillar approaches catastrophically fail, while a multi-pillar approach retains usable accuracy:
This gap is not a fluke; it reflects a fundamental limitation. Each pillar, when pushed to extremes, saturates its own dimension of compression while leaving other dimensions untouched. The trinity framework argues for balanced compression across all three dimensions simultaneously.
Why Compression Is a Hardware Problem
To understand why these three pillars are complementary, it helps to understand what actually limits performance on modern accelerators. The dominant abstraction is the roofline model, which characterizes any workload by two potential bottlenecks:
- Compute-bound: The workload is limited by the number of floating-point operations per second (FLOP/s) the hardware can execute. Large batch matrix multiplications typically fall here.
- Memory-bound: The workload is limited by how fast data can be moved from memory to the compute units (GB/s). Autoregressive token generation with small batch sizes is almost always memory-bound.
The arithmetic intensity of an operation, defined as the ratio of FLOPs to bytes transferred, determines which regime it falls into:
If the arithmetic intensity exceeds the hardware's compute-to-bandwidth ratio, the operation is compute-bound; otherwise, it is memory-bound. For a standard matrix multiplication \(Y = XW\) where \(W \in \mathbb{R}^{m \times n}\), the arithmetic intensity depends on the batch size. With batch size 1, the intensity is roughly \(\frac{2mn}{(m+n) \cdot \text{bytes per element}}\), which for large \(m\) and \(n\) and FP16 simplifies to approximately \(\frac{2mn}{2(m+n)}\). For typical LLM layer sizes, this puts single-token generation firmly in the memory-bound regime.
Where Time Actually Goes
In both training and inference, linear layers dominate execution time. Profiling studies show that linear layers account for approximately 51.8% of training time and 34.3% of inference time for LLMs. Attention mechanisms, layer norms, and communication overhead account for the rest. This makes linear layers the primary target for compression.
Mapping Pillars to Bottlenecks
With this hardware context, the complementarity of the three pillars becomes clear:
- Sparsity skips zero-valued multiply-accumulate operations, directly reducing the FLOPs required. This helps in compute-bound regimes (large-batch training, prefill). Semi-structured patterns like 2:4 sparsity are natively accelerated on NVIDIA Ampere and Hopper Sparse Tensor Cores, yielding up to 2× speedup on supported operations.
- Quantization reduces the number of bytes that must be moved from memory per parameter. Going from FP16 to INT4 cuts memory traffic by 4×, directly alleviating the memory-bandwidth bottleneck that dominates autoregressive inference.
- Low-rank approximations reduce the effective number of parameters by decomposing \(W \in \mathbb{R}^{m \times n}\) into \(LR\) where \(L \in \mathbb{R}^{m \times r}\) and \(R \in \mathbb{R}^{r \times n}\) with \(r \ll \min(m,n)\). This shrinks memory footprint and can reduce FLOPs, especially when the original matrix has high intrinsic redundancy.
Because these three techniques target largely orthogonal axes of inefficiency, their benefits compound when applied together rather than conflicting with each other.
Pillar 1: Sparsity
Sparsity compresses a model by identifying and removing (zeroing out) weights that contribute least to the model's output. A weight matrix with sparsity ratio \(s\) has a fraction \(s\) of its entries set to zero. The challenge is deciding which weights to remove and how to compensate for their loss.
Types of Sparsity
Sparsity patterns differ in their regularity, which directly affects hardware acceleration:
- Unstructured sparsity allows any individual weight to be zeroed. This offers maximum flexibility for preserving accuracy but creates irregular memory access patterns that are notoriously hard to accelerate on parallel hardware like GPUs. Recent kernel-level advances such as SpInfer and FlashLLM have begun enabling efficient unstructured sparse execution on NVIDIA and AMD GPUs, and specialized accelerators like Cerebras' wafer-scale engines deliver near-ideal speedups for unstructured patterns.
- Structured sparsity removes entire rows, columns, or blocks. This is hardware-friendly but often too coarse, removing entire feature dimensions and causing significant accuracy loss.
- Semi-structured (N:M) sparsity bridges the gap: exactly \(N\) out of every \(M\) consecutive weights are non-zero. The 2:4 pattern (50% sparsity) is natively supported by NVIDIA Ampere and Hopper Sparse Tensor Cores, offering up to 2× speedup on matrix multiplications with a fine-grained pattern that preserves accuracy far better than coarse structured pruning.
The Layer-wise vs. Global Dilemma
Most modern one-shot pruning methods operate layer-wise: they prune and reconstruct each linear layer independently, minimizing the local output error \(\|WX - \hat{W}X\|^2\). This is computationally tractable but fundamentally limited because minimizing local errors does not guarantee minimizing the global, end-to-end model error. Errors can compound across layers, and a locally optimal pruning decision may be globally suboptimal.
True end-to-end approaches optimize the pruning mask by backpropagating through the entire model. While more expensive, they can find significantly better masks, particularly for unstructured sparsity at high compression ratios.
Surveying the Landscape
OBS-Family: One-shot Layer-wise Methods
The Optimal Brain Surgeon (OBS) framework provides the mathematical foundation for most modern pruning methods. It uses second-order information (the Hessian of the loss with respect to weights) to determine which weights to remove and how to optimally update the remaining weights to compensate.
- SparseGPT adapts OBS for LLMs by approximating the Hessian using calibration data and processing weights in blocks. It supports both unstructured and semi-structured patterns and was the first method to prune 175B-parameter models in one shot.
- Wanda simplifies this further by using a pruning metric based on the product of weight magnitude and input activation norm (\(|w_{ij}| \cdot \|x_j\|\)), requiring no weight updates. It is extremely fast but can be less accurate than methods that perform weight reconstruction.
- OPTIMA reformulates the post-pruning weight reconstruction as a set of parallel row-wise quadratic programs (QPs) with a shared Hessian. By solving these QPs exactly rather than approximately, OPTIMA achieves up to 2.53% higher accuracy than prior layer-wise methods, creating sparse models that are more robust to subsequent quantization.
Learnable Mask Methods
Instead of using a fixed heuristic to select the pruning mask, these methods learn the mask end-to-end by backpropagating through the full model:
- MaskLLM learns 2:4 sparse masks using Gumbel-Softmax sampling, achieving higher accuracy than heuristic-based 2:4 pruning but is constrained to the semi-structured pattern.
- PATCH introduces a learnable hybrid sparsity pattern that mixes dense and 2:4 sparse tiles within each weight matrix. By learning which tiles should be dense vs. sparse, it balances accuracy and hardware speedup, achieving 1.18×–1.38× inference speedup on LLaMA-2-7B with up to 2.96% higher accuracy than MaskLLM.
- LEAP extends learnable masks to fully unstructured sparsity. It replaces fixed pruning rules with a single learnable per-weight mask parameter optimized directly against the end-to-end model loss. This decouples pruning from second-order approximations and hardware constraints, achieving up to 5.40% higher accuracy over state-of-the-art unstructured pruning methods.
Pillar 2: Quantization
Quantization reduces the numerical precision of model parameters. Weights typically stored in 16-bit floating-point (FP16/BF16) can be compressed to 8-bit integers (INT8), 4-bit integers (INT4), or even lower. This yields direct memory savings (e.g., 4-bit is 4× smaller than FP16) and enables the use of faster integer compute units on modern GPUs.
Weight-only vs. Weight-and-Activation
Weight-only quantization compresses only the model's weights, dequantizing them on-the-fly during inference. This is the most common approach for LLM deployment because weights dominate memory while activations are transient. Methods like GPTQ and AWQ fall in this category.
Weight-and-activation quantization (e.g., SmoothQuant) also quantizes intermediate activations, enabling fully integer matrix multiplications. This can yield larger speedups but is more challenging because activations often contain outlier values that are hard to represent in low precision.
The Outlier Problem
LLM activations and weights often contain outlier channels, a small number of dimensions with values orders of magnitude larger than the rest. Naive quantization allocates the same bit-width uniformly, wasting precision on the majority of small values while failing to represent the outliers. This is the central challenge that separates effective quantization methods from naive ones.
Formats and Hardware Support
Beyond standard integer formats (INT8, INT4), several specialized formats have emerged:
- FP4/FP8: Low-precision floating-point formats that retain a small exponent, better capturing the dynamic range of weight distributions. NVIDIA Hopper GPUs natively support FP8.
- NF4 (NormalFloat4): A 4-bit format where quantization levels are optimally spaced for normally distributed data, used in QLoRA.
- MXFP (Microscaling): A block-scaled format where groups of values share a common exponent, providing a balance between dynamic range and precision.
Surveying the Landscape
- GPTQ adapts the Optimal Brain Quantization (OBQ) framework for LLMs, quantizing weights layer-by-layer using second-order information to minimize quantization error. It was the first method to quantize 175B models to 4-bit with acceptable accuracy, and remains a widely used baseline.
- AWQ takes a different approach: instead of modifying the quantization process, it scales weight channels by their activation saliency before quantization. This simple preprocessing step significantly reduces quantization error for important channels.
- QuIP# uses random orthogonal transformations (Hadamard matrices) to make weight matrices more "incoherent" (uniformly distributed) before quantization, combined with lattice codebooks for near-optimal vector quantization. This achieves strong results at very low bit-widths (2-bit).
- SqueezeLLM decomposes weight matrices into a uniformly quantized dense component and a sparse component that captures outliers, combining the memory efficiency of quantization with the flexibility of sparse representations.
- AQLM uses additive multi-codebook quantization, representing each weight group as a sum of entries from multiple learned codebooks. Combined with fine-tuning, it achieves strong results at extreme (2-bit) compression.
Pillar 3: Low-Rank Approximations
Low-rank approximations exploit the observation that weight matrices in LLMs are often over-parameterized: their effective rank is much lower than their full dimensions. A matrix \(W \in \mathbb{R}^{m \times n}\) can be approximated as \(W \approx LR\) where \(L \in \mathbb{R}^{m \times r}\) and \(R \in \mathbb{R}^{r \times n}\) with rank \(r \ll \min(m, n)\). This reduces the parameter count from \(mn\) to \(r(m + n)\) and replaces one large matrix multiplication with two smaller ones.
The SVD Foundation
The Singular Value Decomposition (SVD) provides the theoretically optimal low-rank approximation (in the Frobenius norm sense). Given \(W = U\Sigma V^T\), the best rank-\(r\) approximation is obtained by keeping only the \(r\) largest singular values. However, the optimal rank-\(r\) approximation of \(W\) in isolation may not be optimal when considering the data distribution that flows through the layer, since not all directions in weight space are equally important for the model's actual inputs.
LoRA and Efficient Adaptation
LoRA (Low-Rank Adaptation) popularized the idea that fine-tuning adjustments to LLMs are inherently low-rank. Rather than updating all parameters, LoRA freezes the pretrained weights and adds a trainable low-rank decomposition \(\Delta W = BA\) where \(B \in \mathbb{R}^{m \times r}\) and \(A \in \mathbb{R}^{r \times n}\). This dramatically reduces the number of trainable parameters during fine-tuning while achieving competitive performance with full fine-tuning.
Variants like GaLore extend this principle to pretraining by projecting gradients into a low-rank subspace, reducing the memory footprint of optimizer states.
Low-Rank in Compression
Beyond adaptation, low-rank decompositions serve a critical role in the compression trinity as error compensators. When sparsity and quantization introduce errors, a small low-rank adapter \(\Delta W\) can be computed to minimize the combined error:
This is the key insight behind methods like SLiM, which uses the low-rank pillar not as a standalone compressor but as the "glue" that compensates for the errors introduced by the other two pillars.
When Low-Rank Works (and When It Doesn't)
Low-rank approximations are most effective when the weight matrix has a rapidly decaying singular value spectrum, meaning most of the "information" is concentrated in a small number of directions. This is common in deeper layers and in over-parameterized models. However, they are less effective for inherently high-rank matrices (e.g., embedding layers or layers that need to preserve many independent features). Applying low-rank too aggressively to such layers destroys accuracy. Effective methods either selectively apply low-rank (only to layers where it helps) or use it in a complementary role alongside sparsity and quantization.
Combining the Pillars: Joint Compression
The real challenge in combining compression techniques is not conceptual but mathematical: errors compound. Applying sparsity first, then quantization, or vice versa, can produce worse results than either technique alone because each perturbation interacts with the other.
The Compounded Error Problem
Consider applying sparsity to obtain \(\hat{W}_s\) and then quantization to obtain \(\hat{W}_{sq}\). The total error is:
If the sparsity error \((W - \hat{W}_s)\) and quantization error \((\hat{W}_s - \hat{W}_{sq})\) are correlated or aligned, the total error can be much larger than the sum of their individual magnitudes. This is exactly what happens in practice when methods are applied sequentially without coordination.
Holistic Joint Compression: SLiM
SLiM provides a holistic solution to the joint compression problem by integrating all three pillars in a single optimization. Given a weight matrix \(W\), it simultaneously determines:
- A 2:4 sparse mask and the quantized representation of the surviving weights
- An optimal low-rank adapter \(\Delta W\) computed to minimize the joint error from sparsity and quantization together
The key insight is that the low-rank adapter is not computed against the original weight \(W\), but against the combined sparse-quantized residual. This allows it to compensate specifically for the compounded error. On LLaMA-2-7B with 2:4 sparsity + 4-bit quantization, SLiM achieves an accuracy improvement of up to 5.8% over sequential application of sparsity and quantization.
Efficient Post-Compression Recovery: BEAM
Even with careful joint compression, some accuracy loss is inevitable. Traditional fine-tuning to recover this accuracy requires the full training pipeline: large datasets, multiple GPUs, and significant compute. BEAM proposes a lightweight alternative. Instead of fine-tuning end-to-end, BEAM splits the model into transformer blocks and independently minimizes each block's output error compared to the uncompressed model. This block-wise approach:
- Runs on a single GPU even for large models (since only one block is loaded at a time)
- Requires only ~128 calibration samples and ~32 epochs per block
- Improves accuracy by up to 4.34% over compressed baselines
- Is agnostic to the compression method used (works with Wanda, SparseGPT, GPTQ, SLiM, etc.)
The Trinity Across the LLM Life-cycle
The principles of the compression trinity are not applied uniformly; they must be adapted to the unique constraints and objectives of each stage in an LLM's life-cycle.
1. Accelerating Pretraining
Pretraining is the most computationally expensive phase, often consuming thousands of GPU-hours. Compression during pretraining must be careful not to limit the model's ability to learn.
- SLoPe jointly applies sparsity and low-rank from the start of pretraining. It introduces a novel "double-pruned" backward pass that prunes both activations and gradients for 2:4 sparsity acceleration, then inserts "lazy" low-rank adapters in the final training iterations to recover any accuracy lost from sparsity. This yields 1.54× faster inference and 1.25× faster training while maintaining accuracy. Featured on the PyTorch official blog.
- MKOR demonstrates the generality of the trinity principles by applying them to the optimization process itself. It is a second-order optimizer that uses sparsity to approximate second-order information efficiently, low-rank (rank-1) updates to maintain and invert the Kronecker-factored preconditioner, and is implemented in stable 16-bit precision (leveraging quantization principles). This achieves up to 2.57× training speedup on distributed systems.
2. Post-Training Compression for Inference
Given a pretrained model, post-training compression aims to compress it in one shot with minimal accuracy loss and no expensive retraining. This is the most active area of research and where the failure of isolation is most visible.
- SLiM provides the complete one-shot solution integrating all three pillars, as discussed in the joint compression section above.
- OPTIMA strengthens the sparsity pillar, producing highly accurate sparse models that are more robust to subsequent quantization.
- PATCH strengthens the sparsity pillar with a hardware-aware hybrid pattern.
- BEAM provides efficient post-compression recovery without full fine-tuning.
3. Elastic Inference and Adaptive Serving
A newer direction in compression is elastic inference: rather than compressing to a single fixed configuration, the goal is to produce models that can dynamically adjust their size and cost at serving time. This enables a single checkpoint to serve different latency and throughput targets.
- SLICE automates sub-model selection for Matryoshka-style architectures. It treats the selection of per-layer width/precision configurations as a differentiable optimization problem, using Gumbel-Softmax relaxation to search over the exponentially large space of valid configurations. Given a target budget (memory, latency, or parameter count), SLICE finds configurations that consistently outperform uniform or hand-tuned baselines.
4. Emerging: End-to-End Unstructured Pruning
The newest frontier in sparsity is moving beyond layer-wise heuristics to true end-to-end optimization of unstructured sparsity masks. This is motivated by the growing availability of hardware and kernels that can efficiently execute unstructured sparse patterns.
- LEAP replaces all fixed pruning rules with a single learnable per-weight mask, optimized end-to-end. This achieves up to 5.40% higher accuracy over layer-wise methods, demonstrating that the gap between local and global optimization is substantial and exploitable.
Practical Guide & Resources
When to Use What
Choosing the right compression strategy depends on your deployment constraints:
- Memory-constrained serving (e.g., single GPU inference): Start with 4-bit weight-only quantization (GPTQ or AWQ). This is the lowest-effort, highest-impact optimization for most deployment scenarios. If accuracy is still insufficient, add 2:4 sparsity via SLiM for a joint approach.
- Latency-sensitive serving (e.g., real-time chat): Combine quantization with 2:4 sparsity to reduce both memory bandwidth and compute. SLoPe-pretrained models or SLiM-compressed models are designed for this regime.
- Extreme compression (8×+): Do not try to achieve this with a single pillar. Use a multi-pillar approach like SLiM (sparsity + quantization + low-rank) to distribute the compression budget across dimensions.
- Maximum accuracy at moderate compression: Use OPTIMA or LEAP for high-quality sparsity, then layer on quantization. Follow up with BEAM for lightweight accuracy recovery.
- Training acceleration: Use SLoPe for sparse pretraining with 2:4 acceleration, or MKOR for more efficient optimization.
- Flexible deployment across hardware tiers: Use elastic approaches like SLICE to produce models that adapt their footprint to available resources.
Code Repositories
- SLiM — Joint sparsity + quantization + low-rank compression
- SLoPe — Sparse plus low-rank pretraining
- MKOR — Kronecker-factored second-order optimizer
- OPTIMA — QP-based optimal pruning
- PATCH — Learnable tile-level hybrid sparsity
- LEAP — End-to-end learnable unstructured pruning
- BEAM — Block-wise error minimization post-compression
Recommended Reading
Limitations and Open Problems
Despite significant progress, several fundamental challenges remain:
- Activation compression: The trinity focuses on weight compression, but activations (especially the KV cache during long-context inference) can dominate memory. Extending the framework to jointly compress weights and activations remains an open problem.
- Hardware-software co-design: Most compression methods are designed for existing hardware. Truly optimal compression would co-design the sparsity pattern, quantization format, and low-rank structure with the hardware's compute and memory hierarchy. Current hardware supports only a limited set of patterns (e.g., 2:4 sparsity, specific INT/FP formats).
- Scaling laws for compressed models: We lack reliable scaling laws that predict how compressed models will perform. Do the same Chinchilla-optimal training recipes apply when training with sparse or quantized weights from scratch? How does the optimal compression ratio scale with model size?
- Task-specific vs. general compression: Most methods optimize for perplexity or average zero-shot accuracy. But models compressed for general benchmarks may lose disproportionate performance on specific downstream tasks (e.g., code generation, mathematical reasoning). Understanding and mitigating this task-specific degradation is an open area.
- Dynamic and conditional compression: Rather than statically compressing a model, future approaches may dynamically adjust the compression level per-token or per-layer based on the difficulty of the input. Elastic methods like SLICE are a first step in this direction.
- Theoretical foundations: While empirical results are strong, theoretical understanding of why joint compression works so well, and precise characterization of when it fails, is still limited.
Acknowledgements
This hub draws on research conducted under the supervision and mentorship of Dr. Amir Yazdanbakhsh and Prof. Maryam Mehri Dehnavi. Their guidance and insights have been instrumental in shaping and advancing this work in LLM compression.