Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
15
Why It Matters For Business
ShortGPT offers a low-effort way to cut model size and inference cost by removing low-impact layers; it preserves most classification-style performance and stacks with quantization to save more compute.
Summary TLDR
The paper finds substantial redundancy across LLM layers and offers a very simple pruning recipe: measure each layer's Block Influence (BI), then remove the lowest-BI layers. Across many open models and benchmarks ShortGPT removes ~25% of layers/parameters while retaining roughly 85–92% of original performance on classification-style benchmarks. The method is model-agnostic (applies to some non-transformers), works with quantization, and can be partly recovered with lightweight post-training.
Problem Statement
Large LLMs are expensive to run. The authors argue many layers change hidden states very little and can be removed safely. They ask: can we rank and delete redundant layers to shrink models and speed inference without heavy retraining?
Main Contribution
Block Influence (BI): a simple metric that measures how much a layer changes hidden states.
ShortGPT: rank layers by BI and remove the lowest-impact layers as a one-shot structured prune.
Empirical evidence that depth redundancy is common (transformers and some non-transformers), that ShortGPT beats several structured baselines at ~25% pruning, and that pruning combines with quantization and light post-training.
Key Findings
Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.
ShortGPT (BI-based layer removal) preserves more benchmark performance than other structured pruning baselines at similar prune ratios.
BI correlates with layer importance: lower BI layers cause smaller perplexity/metric drops when removed.
ShortGPT works beyond transformers: Mamba (non-transformer) kept ~90% average at 25% layer removal, while RWKV was less robust.
Layer removal is orthogonal to quantization but their combined effect compounds performance loss.
Generative tasks (summarization) can collapse after heavy layer removal on smaller models.
Results
MMLU
Average performance retention
Average performance retention (LaCo baseline)
Throughput (tokens/s)
MMLU (quantization and ordering)
Who Should Care
What To Try In 7 Days
Run BI on a small unlabeled calibration set (PG19) to rank layers.
Remove a few layers (start with ~10–25%) and measure perplexity and key task scores.
Combine pruning with 4-bit quantization on a dev clone and compare orders (quant→prune vs prune→quant).
Optimization Features
Token Efficiency
- Throughput gains (e.g., ~1.16x at 25% prune)
Model Optimization
- Pruning
- Layer removal (structured)
System Optimization
- Works on heterogeneous GPU clusters used in experiments
Training Optimization
- Accuracy
Inference Optimization
- Reduced depth -> faster inference
- Compatible with quantization
Reproducibility
Data Urls
- PG19
- MMLU
- CMMLU
- HellaSwag
- PIQA
- CHID
- WSC
- CoQA
- BoolQ
- RACE
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Generative tasks (summarization) can collapse after aggressive layer removal on smaller models.
- Last layer FFN is critical; deleting final components can heavily hurt perplexity.
- Effect depends on model size and architecture—RWKV showed weaker redundancy.
- Post-training is needed to recover some lost quality for sensitive tasks.
When Not To Use
- Generative-heavy applications without retraining
- Small models where depth redundancy is weaker
- When exact probability outputs / calibrated logits are required
Failure Modes
- Accumulated errors in autoregressive generation leading to near-zero quality on summarization after heavy pruning
- Order-dependent degradation when combining with quantization
- Over-pruning important shallow or final-layer components if BI is computed on an unrepresentative calibration set
Core Entities
Models
- LLaMA2-7B
- LLaMA2-13B
- Baichuan2-7B
- Baichuan2-13B
- RWKV-7B
- Mamba-2.8B
Metrics
- Block Influence (BI)
- Perplexity
- MMLU score
- Throughput (tokens/s)
- Performance retention (%)
Datasets
- PG19
- MMLU
- CMMLU
- HellaSwag
- PIQA
- CHID
- WSC
- CoQA
- BoolQ
- RACE
- XSum
- C3
Benchmarks
- MMLU
- CMMLU
- HellaSwag
- PIQA
- BoolQ
- CoQA
- XSum
- C3
- RACE

