Overview
The method is simple and reproducible: BI is easy to compute and removal is one-shot. Evidence comes from multiple models and benchmarks, but heavy drops on generative tasks and per-model variability mean production use needs per-task validation.
Citations15
Evidence Strength0.70
Confidence0.82
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
ShortGPT offers a low-effort way to cut model size and inference cost by removing low-impact layers; it preserves most classification-style performance and stacks with quantization to save more compute.
Who Should Care
Summary TLDR
The paper finds substantial redundancy across LLM layers and offers a very simple pruning recipe: measure each layer's Block Influence (BI), then remove the lowest-BI layers. Across many open models and benchmarks ShortGPT removes ~25% of layers/parameters while retaining roughly 85–92% of original performance on classification-style benchmarks. The method is model-agnostic (applies to some non-transformers), works with quantization, and can be partly recovered with lightweight post-training.
Problem Statement
Large LLMs are expensive to run. The authors argue many layers change hidden states very little and can be removed safely. They ask: can we rank and delete redundant layers to shrink models and speed inference without heavy retraining?
Main Contribution
Block Influence (BI): a simple metric that measures how much a layer changes hidden states.
ShortGPT: rank layers by BI and remove the lowest-impact layers as a one-shot structured prune.
Key Findings
Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.
ShortGPT (BI-based layer removal) preserves more benchmark performance than other structured pruning baselines at similar prune ratios.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| MMLU | 55.0 -> 52.2 | 55.0 | -2.8 | LLaMA2-13B, 25% layers removed (10/40) | Intro example | Introduction |
| Average performance retention | ~86.3% | 100% | -13.7 pp | ShortGPT vs Dense at ~27% pruning (Table 2) | ShortGPT avg retention ≈86.31% at 27.1% prune | Table 2 |
What To Try In 7 Days
Run BI on a small unlabeled calibration set (PG19) to rank layers.
Remove a few layers (start with ~10–25%) and measure perplexity and key task scores.
Combine pruning with 4-bit quantization on a dev clone and compare orders (quant→prune vs prune→quant).
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Generative tasks (summarization) can collapse after aggressive layer removal on smaller models.
Last layer FFN is critical; deleting final components can heavily hurt perplexity.
When Not To Use
Generative-heavy applications without retraining
Small models where depth redundancy is weaker
Failure Modes
Accumulated errors in autoregressive generation leading to near-zero quality on summarization after heavy pruning
Order-dependent degradation when combining with quantization

