ShortGPT: remove low-impact layers to cut ~25% size while keeping ≈90% of performance

March 6, 20247 min

Overview

Decision SnapshotNeeds Validation

The method is simple and reproducible: BI is easy to compute and removal is one-shot. Evidence comes from multiple models and benchmarks, but heavy drops on generative tasks and per-model variability mean production use needs per-task validation.

Citations15

Evidence Strength0.70

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen

Links

Abstract / PDF / Data

Why It Matters For Business

ShortGPT offers a low-effort way to cut model size and inference cost by removing low-impact layers; it preserves most classification-style performance and stacks with quantization to save more compute.

Who Should Care

Summary TLDR

The paper finds substantial redundancy across LLM layers and offers a very simple pruning recipe: measure each layer's Block Influence (BI), then remove the lowest-BI layers. Across many open models and benchmarks ShortGPT removes ~25% of layers/parameters while retaining roughly 85–92% of original performance on classification-style benchmarks. The method is model-agnostic (applies to some non-transformers), works with quantization, and can be partly recovered with lightweight post-training.

Problem Statement

Large LLMs are expensive to run. The authors argue many layers change hidden states very little and can be removed safely. They ask: can we rank and delete redundant layers to shrink models and speed inference without heavy retraining?

Main Contribution

Block Influence (BI): a simple metric that measures how much a layer changes hidden states.

ShortGPT: rank layers by BI and remove the lowest-impact layers as a one-shot structured prune.

Key Findings

Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.

NumbersMMLU 55.0 -> 52.2 (25% layers removed)

Practical UseYou can cut ~25% of layers with a modest drop on knowledge benchmarks; try this first when faster inference is needed.

Evidence RefIntroduction (example) / Table 2

ShortGPT (BI-based layer removal) preserves more benchmark performance than other structured pruning baselines at similar prune ratios.

NumbersAverage performance retention ≈86.3% (ShortGPT) vs 80.4% (LaCo) at ~27% pruning

Practical UseFor coarse pruning (~20–30% reduction), prefer BI-based layer removal over width-reduction or PCA-style methods.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MMLU55.0 -> 52.255.0-2.8LLaMA2-13B, 25% layers removed (10/40)Intro exampleIntroduction
Average performance retention~86.3%100%-13.7 ppShortGPT vs Dense at ~27% pruning (Table 2)ShortGPT avg retention ≈86.31% at 27.1% pruneTable 2

What To Try In 7 Days

Run BI on a small unlabeled calibration set (PG19) to rank layers.

Remove a few layers (start with ~10–25%) and measure perplexity and key task scores.

Combine pruning with 4-bit quantization on a dev clone and compare orders (quant→prune vs prune→quant).

Optimization Features

Token Efficiency
Throughput gains (e.g., ~1.16x at 25% prune)
Model Optimization
PruningLayer removal (structured)
System Optimization
Works on heterogeneous GPU clusters used in experiments
Training Optimization
Accuracy
Inference Optimization
Reduced depth -> faster inferenceCompatible with quantization

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

PG19MMLUCMMLUHellaSwagPIQACHIDWSCCoQABoolQRACE

Risks & Boundaries

Limitations

Generative tasks (summarization) can collapse after aggressive layer removal on smaller models.

Last layer FFN is critical; deleting final components can heavily hurt perplexity.

When Not To Use

Generative-heavy applications without retraining

Small models where depth redundancy is weaker

Failure Modes

Accumulated errors in autoregressive generation leading to near-zero quality on summarization after heavy pruning

Order-dependent degradation when combining with quantization

Core Entities

Models

LLaMA2-7BLLaMA2-13BBaichuan2-7BBaichuan2-13BRWKV-7BMamba-2.8B

Metrics

Block Influence (BI)PerplexityMMLU scoreThroughput (tokens/s)Performance retention (%)

Datasets

PG19MMLUCMMLUHellaSwagPIQACHIDWSCCoQABoolQRACEXSumC3

Benchmarks

MMLUCMMLUHellaSwagPIQABoolQCoQAXSumC3RACE