BEAM: Blockwise Error Minimization for One-shot Compression of LLMs

Introduction

Large language models (LLMs) achieve state-of-the-art performance in natural language tasks but require extensive memory and computational resources due to their billions of parameters. Model compression techniques, such as pruning and quantization, reduce these demands while aiming to preserve accuracy, though they often incur performance trade-offs relative to uncompressed models. However, these methods face significant practical limitations, including high retraining costs, which we address in this work.

Pruning and quantization reduce the computational and memory demands of LLMs but necessitate costly retraining on large datasets to restore accuracy. This retraining incurs significant computational expense and a memory footprint up to 4× that of the original model, often requiring multiple GPUs even for moderate-sized LLMs. Such resource demands render these methods impractical for many applications, motivating the development of a more efficient compression strategy.

To reduce the retraining overhead of model compression, we propose BEAM, a block-wise error minimization mechanism that optimizes models with minimal compute and memory costs. BEAM splits the model into multiple blocks and trains each block's weights to minimize errors introduced by compression, enhancing efficiency without extensive retraining. Compatible with pruning and quantization methods like Wanda, SparseGPT, GPTQ, and SLiM, BEAM improves accuracy by up to 4.34% over compressed models, with its methodology detailed in the following sections.

The code for BEAM is available at our GitHub repository.

Preliminaries

The transformer architecture, foundational to most large language models (LLMs), consists of stacked blocks that, while structurally identical, use distinct weights to process data, making them prime targets for compression techniques that reduce model size without compromising accuracy. In model compression, the goal is to reduce The most common approach focuses on minimizing the error in each linear layer before and after compression. However, this layer-wise error minimization often necessitates a resource-intensive fine-tuning step to maintain model accuracy, which we discuss next.

Fine-tuning, a necessary step to recover accuracy after model compression, is resource-intensive due to the need to update the entire model at once. During this process, the model is trained on the original dataset with a lower learning rate to prevent catastrophic forgetting. This requires storing large gradients, activations, and optimizer states in high precision, leading to a memory overhead of up to 4x the original model's size. Additionally, the computational cost is substantial, as fine-tuning demands hundreds of millions of tokens to reduce the accuracy gap between the original and compressed models. These challenges highlight the need for more efficient methods to optimize compressed models, such as the block-wise error minimization approach we propose.

BEAM

Instead of relying on the similarity between the model's output and the ground truth, BEAM leverages intermediate representations to guide the training of compressed models, offering a novel approach to model compression. In traditional training from scratch, the optimal intermediate values are unknown, but BEAM utilizes those from the original model to inform the compression process.¹ This method allows for more targeted and efficient training of the compressed model.

BEAM employs a block-wise error minimization strategy, dividing the model into individual transformer blocks and training each block's weights to minimize the error introduced by compression, significantly reducing memory and compute requirements. The error is calculated by comparing intermediate values of the original and compressed models, then backpropagated to adjust the block's weights. This approach enables optimization on a single GPU even for large models and achieves substantial accuracy improvements with only hundreds of thousands of tokens, making it adaptable to various compression techniques discussed next.

BEAM adapts seamlessly to different compression methods like sparsity and quantization without requiring changes to the underlying techniques. When applying sparsity, it maintains the same sparsity mask during block optimization, and for quantization, it uses straight-through estimators to backpropagate errors through the quantization function. This flexibility allows BEAM to integrate with methods such as Wanda, SparseGPT, GPTQ, and SLiM effortlessly.

Additionally, a multi-block version of BEAM optimizes several blocks simultaneously, reducing fine-tuning time while enabling optimization of larger model portions. This capability captures inter-layer interactions during error minimization, and in the extreme case where all blocks are optimized together, BEAM mirrors traditional fine-tuning. This flexibility lets users balance optimization scope and efficiency, with independent block groups enabling further enhancements like parallelization.

The independent optimization of different block groups in BEAM allows for parallelization, significantly accelerating the tuning process when extensive computational resources are available. This scalability enhances BEAM's efficiency for large-scale compression tasks. The following section details BEAM's implementation, providing practical insights into its application.

Implementation Details

BEAM begins by taking both the original and compressed models as input, splitting them into transformer blocks to efficiently generate training data. It calculates and stores intermediate values from the original model to create an optimization dataset for the compressed model. Following approaches like SLiM and Wanda, BEAM uses 128 samples of sequence length 2048 from the C4 dataset, iterating over them for 32 epochs, a process optimized by specific hyperparameters discussed next.

For optimization, BEAM employs the Adam optimizer with hyperparameters chosen to balance performance and efficiency. The learning rate, selected from 1e-3 to 1e-6 based on validation performance using the first 10 training samples, pairs with a linear scheduler ending at 1e-3, and each block trains with a batch size of 2048 (sequence length). These settings ensure effective compression, as evidenced by the experimental results that follow.

The following pseudo-code summarizes the BEAM algorithm.

BEAM Algorithm

\[ \begin{array}{rll} & \textbf{Input: } \\ & \quad M^O \text{: Original Model},\quad M^C \text{: Compressed Model}, \quad D \text{: Calibration Dataset}, \\ & \quad n \text{: Number of Blocks}, \quad g \text{: Block Granularity} \\ & \textbf{Procedure } \text{BEAM-Optimization}(M^O, M^C, D) \\ 1: & \quad X_0 = D \\ & \\ 2: & \quad \textbf{for } i \in \{1, 2, \ldots, n // g\} \textbf{ do} \\ & \quad \quad \beamcomment{\text{// Extract the corresponding blocks from the models}} \\ 3: & \quad \quad B^O_i = \text{Block}(M^O, (i - 1) * g : i * g) \\ 4: & \quad \quad B^C_i = \text{Block}(M^C, (i - 1) * g : i * g) \\ & \\ & \quad \quad \beamcomment{\text{// Generate training data for each block}} \\ 5: & \quad \quad X_i = \text{Forward}(B^O_i, X_{i-1}) \\ & \\ & \quad \quad \beamcomment{\text{// Optimize each block}} \\ 6: & \quad \quad lr = \text{SearchLR}(B^C_i, X_i[0:10]) \\ 7: & \quad \quad \text{Optimizer} = \text{Adam}(lr) \\ 8: & \quad \quad \text{MinimizeMSE}(B^C_i, X_{i-1}, X_i, \text{Optimizer}) \\ & \\ & \quad \quad \beamcomment{\text{// Generate the input of next block using the updated compressed model}} \\ 9: & \quad \quad X_i = \text{Forward}(B^C_i, X_{i-1}) \\ 10: & \quad \textbf{end for} \\ \end{array} \]

Experiments

We evaluate BEAM on Wanda, SparseGPT, and SLiM pruning methods and on GPTQ, and SLiM-Quant, and AbsMax quantization methods. GPTQ and AbsMax use a group size of 32, and SLiM-Quant quantizes each weight tensor with a single parameter. We evaluate these methods with and without BEAM on MMLU, PIQA, Arc-Easy, Arc-Challenge, WinoGrande, and OpenBookQA zero-shot downstream tasks and report the average achieved accuracy. Additionally, we report the perplexity of the method on WikiText-2 dataset.

The complete set of results with task-specific performance is available at our W&B report.

Accuracy Results

The following tables show the accuracy of the compressed models with and without BEAM on the zero-shot downstream tasks. The results show that BEAM improves the accuracy of the compressed models by up to 4.34% compared to the compressed models without BEAM. These accuracy improvements are comparable with the optional low-rank adapter fine-tuning used in SLiM but at a fraction of the time.

Language Modeling Results

The following table shows the perplexity of the compressed models with and without BEAM on the WikiText-2 dataset. The results show that BEAM improves the perplexity of the compressed models significantly compared to the compressed models without BEAM.

These results show that BEAM reduces the perplexity more significantly than its improvement in accuracy. Considering that the accuracy experiments are multiple-choice tasks, and the perplexity results in language modeling tasks evaluate the fluency of the model, this entails that BEAM is very effective in improving the fluency of the model.

Timing Results

We report the fine-tuning time of BEAM on a single A100-40GB GPU. The results show that BEAM can fine-tune models up to 12B parameters in less than 4 hours. This time is significantly less than the fine-tuning time of the original models in practice, fine-tuning a 12B model on only 300,000 tokens takes over 36 days on a single A100-40GB GPU! The fine-tuning time in BEAM is comparable to the compression time of the original model, making it a practical solution for boosting the performance of compressed models.

Additionally, it can be seen that scaling the number of blocks in BEAM does not affect the end-to-end fine-tuning time. This allows for tuning the block granularity in BEAM to achieve the highest accuracy for a given fine-tuning time.

Effects of Number of Samples

A hyperparameter of BEAM is the number of samples used for the fine-tuning. We evaluate the effect of this hyperparameter on the fine-tuning time of BEAM. Using more samples in BEAM allows exploring a larger variety of examples for the fine-tuning process, but more samples come at the cost of more CPU memory usage or disk offloading to store the intermediate values of the original model. Our experiments show that increasing the number of samples from 128 to 2048, while keeping the total number of training steps constant (4096 steps for each block), does not have a significant effect on the accuracy of the model. This shows that BEAM is robust to the number of samples used for the fine-tuning process.

Conclusion and Future Work

In this article, we introduce BEAM, an efficient block-wise error minimization mechanism designed with minimal computational and memory overhead. BEAM serves as an add-on for various pruning and quantization techniques, including Wanda, SparseGPT, GPTQ, and SLiM, achieving up to 4.34% accuracy improvement in compressed models. Moving forward, we aim to explore BEAM's impact on additional compression methods and investigate its potential for training models from scratch. Furthermore, BEAM can be adapted for distributed settings to reduce fine-tuning time, an implementation we reserve for future work.

Acknowledgement

I would like to thank Dr. Amir Yazdanbakhsh for his invaluable insights and feedback guidance on this work. His expertise was instrumental in shaping this blog post.

Citation

If you use BEAM in your research, please cite our work:

@misc{mozaffari2025beam,
  author = {Mozaffari, Mohammad},
  title = {BEAM: Blockwise Error Minimization for One-shot Compression of LLMs},
  year = {2025},
  month = {June},
  day = {12},
  howpublished = {\url{https://www.cs.toronto.edu/~mmozaffari/compression-trinity/beam/index.html}},
  note = {Blog post}
}

Footnotes

Other works such as QUIP# and Wanda++ have utilized different block-wise tunings for their model compression approaches as well; however, BEAM is the first method that that supports arbitrary combination of sparsity, quantization, and low-rank adapters with the additional capability of multi-block optimization.

Comments

Loading comments...