Introduction

Deployment of modern Large Language Models (LLMs) is bottlenecked by their massive memory and compute requirements, making their widespread, efficient deployment financially prohibitive. To address this, model compression techniques, most notably weight sparsity, have become essential.

Sparsity manifests in three primary forms: structured, semi-structured, and unstructured. Structured and semi-structured variants are natively supported by NVIDIA and AMD GPUs but often proved to be inaccurate, sacrificing critical performance for modest compression gains. Unstructured sparsity, by contrast, retains far superior accuracy, making it the gold standard for high-fidelity pruning, while recent breakthroughs like SpInfer and FlashLLM have enabled its efficient execution on NVIDIA and AMD hardware. Even better, Cerebras' specialized wafer-scale engines deliver near-ideal speedups tailored to unstructured patterns.

Yet, despite this potential, unstructured sparsity's successful application remains contingent on developing highly effective pruning schemes. Critically, current state-of-the-art unstructured pruning methods consistently discard too much accuracy to be considered a viable path for model deployment.

The prevailing paradigm for unstructured pruning, typified by methods following the Optimal Brain Surgeon (OBS) principle (e.g., Wanda, SparseGPT, Thanos, ADMM), relies on a fundamentally flawed layer-wise pruning strategy. These techniques minimize local, layer-wise error in the hope of achieving a low global error. However, this local optimization often fails to minimize the global end-to-end model error, leading to significant, unacceptable accuracy drops, especially for smaller LLMs where the loss budget is tight. This trade-off between compression and performance has left the field without a high-accuracy, deployable solution.

While true end-to-end pruning approaches are the theoretical ideal, they are either impractical for LLMs due to reliance on expensive second-order information (Optimal Brain Damage, or OBD) or are constrained to highly structured sparsity patterns like 2:4 (MaskLLM, PATCH). The latter methods, while practical, cannot be easily adapted to the more accurate, fully unstructured setting, further limiting their applicability for achieving maximal compression gains.

To overcome these significant limitations, we propose LEAP, a novel, practical end-to-end unstructured pruning framework. LEAP replaces complex, heuristic, or fixed pruning rules with a single, learnable, per-weight mask parameter that is optimized directly to minimize the end-to-end model loss. This novel formulation successfully decouples pruning from costly second-order methods and hardware-specific constraints, enabling high-quality unstructured pruning across all LLMs. Through this process, LEAP achieves up to 5.40% higher average accuracy over state-of-the-art unstructured pruning methods across common LLM benchmarks.

The code for LEAP is available at our GitHub repository.

Related Work

Model compression techniques are essential for the efficient deployment of modern Large Language Models (LLMs), with core approaches including quantization, low-rank approximation, and pruning. Within pruning, the choice of sparsity pattern and the underlying algorithm dictate the final memory savings and inference speedups.

Sparsity Patterns and Hardware Acceleration

While unstructured sparsity was historically limited by irregular memory access, recent hardware and software innovations have unlocked its potential. Advances like FlashLLM and SpInfer introduce specialized tensor-core-aware sparse matrix multiplication kernels that achieve significant speedups on conventional GPUs. Furthermore, specialized hardware such as Cerebras' wafer-scale engines are purpose-built for unstructured sparsity, delivering near-linear speedups. These hardware breakthroughs shift the bottleneck entirely from inference acceleration to the development of pruning algorithms that can induce high-quality unstructured masks without excessive accuracy loss.

Pruning Algorithms and The Layer-Wise Paradigm

Pruning algorithms have evolved from early second-order methods to contemporary scalable heuristics. Pioneering methods like Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS) laid the theoretical foundation by using Hessian approximations to minimize global error increases during weight removal. However, these elegant approaches are computationally prohibitive for LLMs, scaling quadratically with parameters and requiring full-dataset second-order information.

The prevailing paradigm for scalable unstructured pruning, therefore, draws from OBS principles but applies them greedily on a layer-wise basis to circumvent global costs. Wanda finds the sparsity pattern for each layer based on the magnitude of the weights and the input of the layer over a small calibration dataset. SparseGPT, on the other hand, uses the layer-wise Hessian of the layer on the calibration dataset and jointly finds the optimal mask and updates the weights in each layer. More recent works like Thanos improves SparseGPT by multi-column pruning to cut approximation errors. ADMM continues to optimize this layer-wise approach by iteratively optimizing for the weights and masks. Despite these gains in scalability, this layer-wise greediness often propagates sub-optimal errors across the network, consistently causing non-trivial accuracy drops on downstream benchmarks, a deficit that is particularly acute for smaller models with tight loss budgets.

True end-to-end pruning, which optimizes the mask holistically against global loss, remains elusive for the unstructured setting. As discussed, OBD/OBS variants are too expensive for LLMs. Practical end-to-end training strategies like MaskLLM and PATCH have successfully applied learnable mask optimization, but they are rigidly confined to semi-structured patterns (e.g., 2:4) to maintain hardware compatibility, capping their ultimate compression ratios and fidelity below unstructured ideals.

In summary, the field is at a critical juncture: while hardware maturity now fully supports unstructured sparsity, algorithmic gaps persist. Layer-wise methods sacrifice too much accuracy, and global alternatives are either cost-prohibitive or rigidly structured. This landscape underscores the urgent need for a lightweight, fully learnable end-to-end framework that can unlock unstructured pruning's full potential, precisely the deficiency that LEAP addresses.

LEAP: Learnable End-to-End Adaptive Pruning of LLMs

LEAP replaces layer-wise, locally optimized pruning with a fully learnable, end-to-end scheme that optimizes a per-weight mask directly against the language modeling loss. Instead of treating pruning as a one-shot post-processing step, LEAP makes pruning decisions first-class learnable variables that are aligned with \(\mathcal{L}_{\text{LM}}\), rather than fixed heuristics applied after the fact. This shift raises a central question: how to formulate unstructured pruning as a stable, differentiable, and scalable learning problem for LLMs.

Per-Weight Mask Parameterization

Unlike MaskLLM and PATCH, which assign a small set of learnable logits to structured groups (e.g., N:M patterns or tiles), LEAP assigns a dedicated scalar parameter to every weight and lets the model decide, end-to-end, which individual weights survive. Concretely, for a weight matrix \(W \in \mathbb{R}^{m \times n}\), LEAP introduces a parallel parameter matrix \(P \in \mathbb{R}^{m \times n}\), and defines a stochastic mask via a Gumbel–sigmoid relaxation:

\[ M = \sigma\left(\frac{\alpha P + g}{\tau}\right), \quad g = -\log(-\log(u)), u \sim \text{Uniform}(0,1) \]

where \(\sigma(\cdot)\) is the sigmoid function, \(\alpha\) is a scale factor, and \(\tau\) is a temperature hyperparameter that controls how sharply the probabilities approach binary decisions. This mapping turns each real-valued parameter \(P_{i,j}\) into a probability in (0, 1) and injects controlled randomness via Gumbel noise, enabling exploration of alternative sparsity patterns early in training while maintaining differentiability through the reparameterization trick. The effective pruned weight matrix is then

\[ \tilde{W} = M \odot W, \]

and in practice LEAP uses soft masks throughout training, which keeps gradients well behaved for large LLMs and avoids the instability often observed with hard sampling and straight-through estimators.

Mask Initialization

We initialize our masks using an established pruning method (e.g., Wanda) to ensure a reasonable initial loss for the model. We employ Wanda due to its efficient and lightweight pruning procedure. The parameters corresponding to non-zero elements are assigned a substantial positive value, while the parameters corresponding to zero elements are set to the negative of that value. This hyperparameter is designated as the initial mask strength.

Stabilization via Scale and Temperature

End-to-end mask learning on LLMs can be volatile if the sampling distribution remains too stochastic for too long. Building on stabilization ideas used in differentiable pruning and learnable sparsity methods like MaskLLM and PATCH, LEAP employs two lightweight schedules to gradually anneal the sampling process from exploratory to decisive. First, a dynamic scaler \(\alpha\) is ramped up over training (for example, from 1 to a larger value such as 10 or higher), which amplifies the logits \(P\) and pushes mask probabilities closer to 0 or 1 as training progresses. Second, a temperature scheduler decays \(\tau\) from a moderate initial value (e.g., 1–2) down to a small value (e.g., \(10^{-2}\)), tightening the sigmoid and effectively making the mask distribution sharper. Together, these schedules implement a smooth transition: early iterations explore a diverse set of candidate sparsity patterns, while later iterations refine and lock in a high-fidelity mask that aligns with the global loss landscape, without resorting to non-differentiable hard thresholds during optimization.

Global Sparsity Control

With a mask parameterization in place, LEAP must still enforce a target sparsity ratio at the model level. Similar to prior learnable sparsity approaches, LEAP controls sparsity by adding an explicit regularization term on the effective masks, but crucially applies it globally rather than layer-wise. Let \(\rho\) denote the desired sparsity ratio (e.g., \(\rho = 0.5\) for 2× compression), and let \(\tilde{M}_i\) and \(W_i\) denote the mask and weight tensors for layer \(i\). LEAP adds a sparsity regularizer

\[ \mathcal{L}_{\text{sparsity}} = \lambda_1 \left| \frac{\sum_i \lVert \tilde{M}_i \rVert_1}{\sum_i \lVert W_i \rVert_0} - \rho \right|, \]

where \(\lambda_1\) is a large positive coefficient that enforces the global sparsity budget tightly in practice. This global formulation fixes the overall sparsity while allowing individual layers to adapt their densities based on how the end-to-end optimization values their contribution, which empirically tends to outperform rigid, uniform per-layer sparsity constraints that ignore inter-layer importance.

Weight-Aware Regularization

Prior learnable sparsity methods have shown that explicitly biasing masks toward retaining larger-magnitude weights can improve stability and final accuracy. Following this insight, LEAP incorporates a negative weight regularization term applied to the pruned weights:

\[ \mathcal{L}_{\text{weight}} = -\lambda_2 \sum_i \lVert \tilde{W}_i \rVert_1, \]

where \(\lambda_2 > 0\) is typically large (for example, on the order of 10). This term encourages the optimization to preserve larger-magnitude, higher-saliency weights, which stabilizes the mask learning process and reduces the risk of converging to degenerate sparse solutions that keep many small, noisy weights while discarding a few critical ones.

Total Optimization Objective

Putting these components together, LEAP learns the mask parameters \(P\) (and optionally fine-tunes \(W\)) by minimizing a three-term objective over a small calibration set \(X\):

\[ \mathcal{L} = \mathcal{L}_{\text{LM}}(\tilde{W}; X) + \lambda_1 \left| \frac{\sum_i \lVert \tilde{M}_i \rVert_1}{\sum_i \lVert W_i \rVert_0} - \rho \right| - \lambda_2 \sum_i \lVert \tilde{W}_i \rVert_1. \]

Here, \(\mathcal{L}_{\text{LM}}\) is the standard language modeling loss evaluated on the sparsified network, \(\mathcal{L}_{\text{sparsity}}\) enforces the desired global sparsity, and \(\mathcal{L}_{\text{weight}}\) biases the solution toward retaining important weights. In practice, this objective can be optimized with standard optimizers such as SGD or AdamW on a general text database (such as C4 or SlimPajama) for a few thousand iteraoins), which is sufficient for LEAP to learn high-quality unstructured masks that achieve substantial parameter reduction while achieving superior accuracy in comparison to other pruning methods.

Please note that LEAP does not update the weights of the model, and only optimizes for the mask. It should be possible to explore joint optimization of the weights and mask, but we have not explored that direction to save on memory and compute.

Experiments

Model, Dataset, and Evaluation. We evaluate LEAP across diverse transformer architectures, including Qwen-2.5, LLaMA-3.2 and Gemma-3 model families, spanning 500M to 3B parameters. Following the dataset size and configurations in MaskLLM and PATCH, masks are trained for 2000 steps with a batch size of 256 on sequences with a length of 4096 tokens from the SlimPajama dataset.

Following previous LLM compression work, we evaluate the models on eight zero-shot downstream tasks: PIQA, ARC-Easy, ARC-Challenge, Winogrande, OpenBookQA, RACE, HellaSwag, and MMLU using the Language Model Evaluation Harness framework. Additionally, we evaluate the models on a language modeling task using the WikiText2 dataset with a sequence length of 4096.

Baselines. To evaluate LEAP against established pruning techniques, we compare it with Wanda, SparseGPT, Thanos, and ADMM. For one-shot pruning methods, following the default configurations in each paper, we prune the models using 128 samples from the C4 dataset.

Model Quality Results

The following tables include the accuracy results of different pruning methods on Gemma-3, LLaMA-3.2 and Qwen-2.5 family of models. Our results indicate that LEAP consistently outperforms other pruning methods, including ADMM, which is the state-of-the-art unstructured sparsity method. LEAP improves the average accuracy of the models across six downstream tasks by up to 5.40% (LLaMA-3.2 1B at 60% sparsity).

In general, the improvements of LEAP are more significant in comparison to the other baselines at higher sparsity ratios. This is because the error imposed by layer-wise pruning methods are bigger, leaving a wider gap for LEAP to fill.

Sparsity Allocation

We analyze the allocation of sparsity across transformer blocks under a global target. The accompanying figure illustrates the pruning ratio for each transformer block within the models. Our findings suggest that, in contrast to methods employing 2:4 sparsity, such as PATCH, which exhibit significant inter-block variations in the sparsity ratio, unstructured sparsity results in a uniform distribution of the sparsity ratio across blocks.

Sparsity Allocation across transformer blocks
Figure 1: Sparsity allocation across transformer blocks.

Hyperparameters

To save on resources, we have tuned our hyperparameters on our smallest model, Qwen-2.5 0.5B, and used the best values for larger models. The following table summarizes the best hyperparameters we used for our experiments.

Conclusions

In this work, we introduced LEAP (Learnable End-to-End Adaptive Pruning), a novel framework that overcomes the limitations of current unstructured LLM pruning by shifting from sub-optimal layer-wise heuristics to a fully end-to-end learning objective. LEAP parameterizes unstructured sparsity as a set of per-weight mask logits, which are optimized directly to minimize the global language modeling loss using a stable, differentiable Gumbel-sigmoid relaxation. Crucially, LEAP incorporates global sparsity control and weight-aware regularization to stabilize the learning process, enabling it to discover high-fidelity unstructured masks within just 2,000 iterations.

Our comprehensive experiments across Gemma, LLaMA, and Qwen model families consistently demonstrated that LEAP substantially outperforms state-of-the-art unstructured pruning baselines (Wanda, SparseGPT, Thanos, ADMM). For instance, LEAP achieved up to 5.40% higher average accuracy on the LLaMA-3.2 3B model at 60% sparsity, significantly closing the performance gap between compressed and dense LLMs, especially at aggressive compression ratios where layer-wise methods fail. By unlocking the high-accuracy potential of unstructured sparsity, LEAP represents a critical step towards the efficient and high-fidelity deployment of large language models on diverse hardware platforms.

Citation

If you use LEAP in your research, please cite our work:

@misc{mozaffari2025leap,
  author = {Mozaffari, Mohammad and Hourri, Younes},
  title = {LEAP: Learnable End-to-End Adaptive Pruning of LLMs},
  year = {2025},
  month = {December},
  day = {17},
  howpublished = {\url{https://www.cs.toronto.edu/~mmozaffari/compression-trinity/leap/index.html}},
  note = {Blog post}
}

Leave a Comment

Comments

Loading comments...