Introduction
Large language models have reached impressive levels of performance, but deploying them efficiently remains awkward and brittle. In practice, teams are forced into a coarse trade-off: either serve a large, expensive model that pushes hardware limits, or fall back to a smaller distilled variant with noticeably weaker reasoning. This leads to a fragmented deployment stack, with separate models trained and maintained for different budgets. A 7B model for edge devices, a 13B model for single GPUs, and a 70B model for cloud inference all coexist, even though they target the same task. What practitioners actually want is a single model that can adapt its footprint to the available compute without retraining.
Matryoshka-style architectures move in that direction. Introduced by MatFormer and later adopted by models such as Gemma-3n, these architectures impose a nested structure on the model so that smaller sub-models can be extracted directly from a larger parent. The same idea has recently been extended beyond width and depth. MatQuant applies nesting to precision, enabling layers that support multiple bit-widths using shared weights. In principle, this turns a single checkpoint into a family of models spanning a wide range of sizes and costs.
The problem is that flexibility alone does not tell us how to use it. Allowing each layer to independently vary in width, depth, or precision creates an enormous configuration space. Even a modest model with 32 layers and four options per layer admits more than $4^{32}$ possible sub-models. Exploring this space by hand is hopeless, and brute-force search is out of the question. As a result, most existing work relies on simple heuristics, such as uniformly shrinking all layers or applying the same bit-width everywhere. These choices are easy to implement, but they rarely correspond to the best accuracy-efficiency trade-off.
We introduce SLICE (Selecting Layer-wise Configurations for Matryoshka-Style LLMs) to address this gap. SLICE treats sub-model selection as a constrained optimization problem rather than a manual design choice. Using a learnable mask-based selection mechanism, the model learns which layers should retain capacity and which can be aggressively reduced to meet a target budget. The target can be expressed in terms of memory, parameter count, or latency. Across a range of settings, SLICE consistently finds configurations that outperform uniform or hand-tuned baselines at the same cost. These results suggest that elastic architectures are only half the solution. The other half is a principled way to decide how, and where, to cut.
SLICE: Automated Configuration Search via Differentiable Relaxation
The interactions between LLM layers are highly non-linear; removing capacity from an early layer might catastrophically affect a later layer, or vice versa. Finding the optimal configuration is essentially a discrete search problem, choosing specific up-projection dimension ratios (e.g., $0.5, 0.75, 1.0$) for every layer. While "mix-and-match" heuristics offer a manual workaround, they are computationally inefficient due to the exponential size of the search space. To automate this, SLICE moves away from fixed heuristics and instead learns the optimal configuration dynamically during training.
SLICE treats the rank selection for each layer as a multi-class classification problem. For a given layer, we define a set of candidate intermediate dimensions (e.g., $50\%$, $75\%$, or $100\%$ of the original dimension). We assign a set of learnable parameters to each layer corresponding to these classes.
Standard discrete sampling (selecting the option with the highest probability) is non-differentiable; it breaks the flow of gradients required for backpropagation. To circumvent this, we employ Gumbel-Softmax relaxation. This reparameterization trick allows us to sample configurations stochastically during the forward pass while maintaining a valid path for gradient updates in the backward pass. Effectively, the model exists in a "superposition" of configurations during training, gradually collapsing into a single discrete choice.
To ensure SLICE explores the search space thoroughly before converging on a final architecture, we utilize two synchronized schedulers that control the "sharpness" of our selection:
- Temperature Annealing: We start training with a high temperature in the Gumbel-Softmax function. This creates a high-entropy distribution, forcing the model to explore "less probable" scenarios and effectively smoothing the optimization landscape. As training progresses, we anneal the temperature down to $0.05$, forcing the soft probabilities to approach a hard, one-hot decision.
- Logit Scaling: Concurrently, we multiply the inputs of the Gumbel-Softmax by a scalar that increases linearly throughout training. This acts as a forcing function to sharpen the classification outputs, ensuring that by the end of the process, the model has made a distinct, unambiguous choice for every layer's rank.
If we optimized solely for language modeling performance, SLICE would simply select the maximum rank for every layer. To enforce efficiency, we incorporate a resource-aware regularization term. Since the Feed-Forward Networks (FFNs) account for the vast majority of parameters in an LLM (approximately $80\%$), the average rank of these layers serves as a precise proxy for the total model size. We allow the user to define a `target_rank` (the desired average capacity) and penalize deviations using the following objective:
$$ \mathcal{L}_{total} = \mathcal{L}_{LLM} + \lambda \cdot | \text{average}(\text{ranks}) - \text{target_rank} | $$This formulation compels SLICE to act as an intelligent broker—trading off capacity in redundant layers to preserve high ranks in the most critical layers, all while strictly adhering to the user's size constraints.
Experiments
Setup and Benchmarks. To validate SLICE, we evaluated its performance across a broad suite of reasoning and knowledge-intensive tasks. Using the Language Model Evaluation Harness, we computed the average accuracy across 8 standard benchmarks: PIQA, ARC-Easy, ARC-Challenge, Winogrande, OpenBookQA, RACE, HellaSwag, and MMLU. We also monitored Perplexity (PPL) on WikiText2 to ensure the compressed models maintained language modelling coherence.
Baseline - A Fair "Mix-and-Match" Simulation: Comparing SLICE against a single manual heuristic (e.g., "prune every layer by 25%") would be insufficient. We wanted to compare SLICE against the entire space of valid configurations. To achieve this, we developed a rigorous random generator that simulates the "mix-and-match" approach. For any given target rank $r$, we mathematically solve for the exact distribution of layers ($t_1, t_2, t_3$) corresponding to the rank ratios $0.5, 0.75, \text{and } 1.0$ such that the parameter budget is strictly met:
$$ 0.5 \cdot t_1 + 0.75 \cdot t_2 + 1.0 \cdot t_3 = r $$Subject to:
$$ t_1 + t_2 + t_3 = 1 $$By sampling uniformly from the valid solution space defined by these constraints, we generated a cloud of "Synthetic" configurations. These represent the various outcomes a practitioner might achieve through trial-and-error.
Model Quality Results. The results, visualized in the figure below, highlight the difference between lucky guesses and learned optimization. The grey dots ("Mix-and-Match") reveal the high variance inherent in the search space. At a fixed average rank of 0.60, valid manual configurations resulted in accuracies ranging wildly from roughly 56% to 57.5%. This confirms that not all parameters are created equal; two models with the exact same size can have drastically different capabilities depending on where the capacity is allocated.
The blue stars ("LEAP") represent the configurations discovered by SLICE, with the attached error bars illustrating the variance in accuracy across different hyperparameter settings. In every instance, SLICE rises above the noise, forming a distinct Pareto frontier. It consistently identifies structural combinations that human heuristics, and even random search, miss entirely.
The table below details the best accuracy and perplexity (PPL) achieved by SLICE compared to the best-found synthetic configurations at various sparsity ratios.
The "Perplexity Trap"
A critical observation from our data is the decoupling of Perplexity (PPL) and downstream accuracy. As seen in the 0.55 rank experiments, the "Mix-and-Match" baseline achieved a lower perplexity (18.02) than SLICE (20.63), yet SLICE outperformed it on reasoning tasks by over a full percentage point (59.27% vs 58.20%).
This underscores a known limitation in model compression: optimizing for next-token prediction (PPL) does not guarantee robust reasoning capabilities. A heuristic might preserve language fluency (low PPL) while accidentally pruning layers critical for logic and commonsense reasoning. SLICE, by learning the architecture dynamically, prioritizes the functional integrity of the model over raw statistical fluency.
Hyperparameters
The following table summarizes the list of hyperparameters that we used for our experiments with LEAP.
Conclusions and Limitations
The era of static model deployment is ending. We can no longer afford to treat LLMs as rigid monoliths that must be swapped out entirely whenever hardware constraints change. While architectures like Matryoshka provide the mechanical ability to scale, they do not provide the intelligence to do so effectively.
Our work with SLICE demonstrates that efficiency is not just about model size. It is about architectural precision. By replacing manual heuristics with a learned, differentiable search, we can extract significantly more reasoning capability from the same parameter budget. We found that the difference between a random cut and a learned slice can be the difference between a model that hallucinates and one that reasons.
However, this advantage is not uniform. We observe that the performance gap between SLICE and manual heuristics narrows as the average rank increases. At higher capacities (e.g., Rank 0.65 and above), the sub-models retain a majority of the original redundancy, making them inherently more robust to suboptimal configuration choices. Thus, SLICE is most critical in aggressive compression regimes, where the margin for error is slim and every parameter must be allocated strategically.
Our work with SLICE demonstrates that efficiency is not just about model size. It is about architectural precision. By replacing manual heuristics with a learned, differentiable search, we can extract significantly more reasoning capability from the same parameter budget. We found that the difference between a random cut and a learned slice can be the difference between a model that hallucinates and one that reasons.
Acknowledgement
I would like to thank Dr. Amir Yazdanbakhsh for his invaluable insights and feedback guidance on this work. His expertise was instrumental in shaping this blog post.
Citation
If you use SLICE in your research, please cite our work:
@misc{mozaffari2025slice,
author = {Mozaffari, Mohammad},
title = {SLICE: Selecting Layer-wise Configurations for Matryoshka-Style LLMs},
year = {2025},
month = {December},
day = {17},
howpublished = {\url{https://www.cs.toronto.edu/~mmozaffari/compression-trinity/slice/index.html}},
note = {Blog post}
}
Leave a Comment
Comments
Loading comments...