Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
Compresso lets teams reduce LLM size and inference cost with modest training and public instruction data while keeping near-original accuracy; this lowers deployment cost on standard GPUs without specialized hardware.
Summary TLDR
Compresso is a training-based structured pruning method for large language models (LLMs). It uses LoRA (low-rank adapters) plus L0 regularization to learn binary masks that remove attention heads, FFN units, and hidden dims while freezing original weights. A short, task-style "collaborative" prompt instructs the model during pruning and inference, improving adaptation. On LLaMA-7B Compresso produces 5.4B / 5.0B / 4.5B variants that largely retain zero-shot and few-shot performance and outperform a structured one-shot baseline (LLM-Pruner) on several benchmarks.
Problem Statement
Structured pruning can cut real inference cost but is hard for LLMs. One-shot (no-training) pruning is cheap but hurts quality; training-based pruning can do better but is extremely memory- and data-hungry. The paper asks whether a memory-efficient, training-based method plus an LLM-aware prompt can learn better layer-wise pruning and recover accuracy under resource constraints.
Main Contribution
A memory-efficient training-based structured pruning pipeline that freezes base weights and learns binary masks while updating only LoRA adapters (4.54M trainable params for LLaMA-7B).
A collaborative pruning prompt that instructs the LLM during pruning and inference to adapt to removed parameters, improving final accuracy.
Extensive evaluation on LLaMA-7B pruning to 5.4B/5.0B/4.5B showing retained generalization and consistent gains over a structured one-shot baseline across commonsense, reading, MMLU and BBH benchmarks.
Key Findings
Compresso prunes LLaMA-7B to 5.4B while preserving most capabilities.
Compresso beats a structured one-shot baseline across tasks.
A collaborative pruning prompt measurably helps.
Instruction-tuning data works better than generic calibration data for pruning.
LoRA + L0 masks reduce training memory footprint.
Results
Accuracy
Reading comprehension (zero-shot, avg)
MMLU (5-shot)
BBH (3-shot, avg)
Improvement vs LLM-Pruner (best reported)
Who Should Care
What To Try In 7 Days
Run LoRA-only fine-tuning + L0 mask training on a small instruction dataset to test mask learning for your model.
Add a short 'pruning instruction' prompt to training and inference to see if model adaptation improves.
Compare model size vs accuracy trade-offs at one or two target sparsities (e.g., 5.4B from 7B) and measure latency/memory on your hardware.
Optimization Features
Infra Optimization
- training fits on 4x V100 GPUs as reported
Model Optimization
- structured pruning (attention heads, FFN units, hidden dims)
- layer-wise learned sparsity (automatic per-layer masks)
System Optimization
- compatible with standard GPU hardware (no special sparse kernels required)
Training Optimization
- LoRA
- L0 (hard-concrete) regularization to control sparsity
Inference Optimization
- smaller model variants (5.4B/5.0B/4.5B) for lower memory and compute
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluations limited to LLaMA-7B/13B; behavior on other architectures is untested.
- Requires instruction-tuning style data; generic calibration data performed worse.
- Needs modest training resources (authors used 4 V100 GPUs).
- Collaborative prompt was manually designed with GPT-4 help; prompt design may need tuning per model/task.
When Not To Use
- If you need ultra-fast one-shot pruning with zero training budget.
- When you lack instruction-tuning style data or cannot run any adapter training.
- If you must guarantee unchanged behavior on very different downstream tasks not covered by instruction data.
Failure Modes
- Higher sparsity can cause larger drops in few-shot tasks (MMLU declines more than zero-shot commonsense).
- Post fine-tuning sometimes harms specific benchmarks (e.g., BBH at 4.5B).
- Layer-wise mask learning could prune critical units if pruning targets are extreme.
Core Entities
Models
- LLaMA-7B
- LLaMA-13B
- Compresso (pruned variants 5.4B, 5.0B, 4.5B)
- LLM-Pruner
- SparseGPT
- Wanda
Metrics
- Accuracy
- perplexity
- remaining parameters (model size)
- relative % change vs baseline
Datasets
- GPT4-Alpaca (instruction tuning, 52K)
- C4 subset
- LLM-QAT
Benchmarks
- StoryCloze
- PIQA
- HellaSwag
- WinoGrande
- ARC-e
- ARC-c
- OpenBookQA
- BoolQ
- RACE-High
- MMLU (5-shot)
- BBH (3-shot)

