Overview
The method shows clear gains on LLaMA-7B across many benchmarks and uses modest compute (4x V100). Results are limited to reported LLaMA variants and public instruction datasets, so expect some engineering to adapt to other models or data.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Compresso lets teams reduce LLM size and inference cost with modest training and public instruction data while keeping near-original accuracy; this lowers deployment cost on standard GPUs without specialized hardware.
Who Should Care
Summary TLDR
Compresso is a training-based structured pruning method for large language models (LLMs). It uses LoRA (low-rank adapters) plus L0 regularization to learn binary masks that remove attention heads, FFN units, and hidden dims while freezing original weights. A short, task-style "collaborative" prompt instructs the model during pruning and inference, improving adaptation. On LLaMA-7B Compresso produces 5.4B / 5.0B / 4.5B variants that largely retain zero-shot and few-shot performance and outperform a structured one-shot baseline (LLM-Pruner) on several benchmarks.
Problem Statement
Structured pruning can cut real inference cost but is hard for LLMs. One-shot (no-training) pruning is cheap but hurts quality; training-based pruning can do better but is extremely memory- and data-hungry. The paper asks whether a memory-efficient, training-based method plus an LLM-aware prompt can learn better layer-wise pruning and recover accuracy under resource constraints.
Main Contribution
A memory-efficient training-based structured pruning pipeline that freezes base weights and learns binary masks while updating only LoRA adapters (4.54M trainable params for LLaMA-7B).
A collaborative pruning prompt that instructs the LLM during pruning and inference to adapt to removed parameters, improving final accuracy.
Key Findings
Compresso prunes LLaMA-7B to 5.4B while preserving most capabilities.
Compresso beats a structured one-shot baseline across tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Compresso 5.4B ~60.09 | LLaMA-7B 62.19 | retains ~96% (−2.1 pts) | StoryCloze, PIQA, HellaSwag, WinoGrande, ARC-e/c, OBQA | Table 2; Sec. 4.2 | Table 2 |
| Reading comprehension (zero-shot, avg) | Compresso 5.4B 60.35 | LLaMA-7B 57.73 | +2.62% vs LLaMA-7B (avg) | BoolQ, RACE-High | Table 3; Sec. 4.2 | Table 3 |
What To Try In 7 Days
Run LoRA-only fine-tuning + L0 mask training on a small instruction dataset to test mask learning for your model.
Add a short 'pruning instruction' prompt to training and inference to see if model adaptation improves.
Compare model size vs accuracy trade-offs at one or two target sparsities (e.g., 5.4B from 7B) and measure latency/memory on your hardware.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluations limited to LLaMA-7B/13B; behavior on other architectures is untested.
Requires instruction-tuning style data; generic calibration data performed worse.
When Not To Use
If you need ultra-fast one-shot pruning with zero training budget.
When you lack instruction-tuning style data or cannot run any adapter training.
Failure Modes
Higher sparsity can cause larger drops in few-shot tasks (MMLU declines more than zero-shot commonsense).
Post fine-tuning sometimes harms specific benchmarks (e.g., BBH at 4.5B).

