ShortGPT: remove low-impact layers to cut ~25% size while keeping ≈90% of performance

March 6, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

15

Authors

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen

Links

Abstract / PDF

Why It Matters For Business

ShortGPT offers a low-effort way to cut model size and inference cost by removing low-impact layers; it preserves most classification-style performance and stacks with quantization to save more compute.

Summary TLDR

The paper finds substantial redundancy across LLM layers and offers a very simple pruning recipe: measure each layer's Block Influence (BI), then remove the lowest-BI layers. Across many open models and benchmarks ShortGPT removes ~25% of layers/parameters while retaining roughly 85–92% of original performance on classification-style benchmarks. The method is model-agnostic (applies to some non-transformers), works with quantization, and can be partly recovered with lightweight post-training.

Problem Statement

Large LLMs are expensive to run. The authors argue many layers change hidden states very little and can be removed safely. They ask: can we rank and delete redundant layers to shrink models and speed inference without heavy retraining?

Main Contribution

Block Influence (BI): a simple metric that measures how much a layer changes hidden states.

ShortGPT: rank layers by BI and remove the lowest-impact layers as a one-shot structured prune.

Empirical evidence that depth redundancy is common (transformers and some non-transformers), that ShortGPT beats several structured baselines at ~25% pruning, and that pruning combines with quantization and light post-training.

Key Findings

Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.

NumbersMMLU 55.0 -> 52.2 (25% layers removed)

ShortGPT (BI-based layer removal) preserves more benchmark performance than other structured pruning baselines at similar prune ratios.

NumbersAverage performance retention ≈86.3% (ShortGPT) vs 80.4% (LaCo) at ~27% pruning

BI correlates with layer importance: lower BI layers cause smaller perplexity/metric drops when removed.

ShortGPT works beyond transformers: Mamba (non-transformer) kept ~90% average at 25% layer removal, while RWKV was less robust.

NumbersMamba avg retention 90.42% at 25% prune (Table 3)

Layer removal is orthogonal to quantization but their combined effect compounds performance loss.

NumbersBaseline MMLU 45.4 → 4-bit quant 44.9 → layer removal 44.0 → quant then remove 42.4 (Table 5)

Generative tasks (summarization) can collapse after heavy layer removal on smaller models.

NumbersXSum/C3 scores drop near zero for some 7B models at 25% removal (Section 5, Table 2)

Results

MMLU

Value55.0 -> 52.2

Baseline55.0

Average performance retention

Value~86.3%

Baseline100%

Average performance retention (LaCo baseline)

Value80.39%

Baseline100%

Throughput (tokens/s)

Value4331 -> 5045

Baseline4331.23

MMLU (quantization and ordering)

Value45.4 -> 44.9 -> 42.4

Baseline45.4 (dense)

Who Should Care

What To Try In 7 Days

Run BI on a small unlabeled calibration set (PG19) to rank layers.

Remove a few layers (start with ~10–25%) and measure perplexity and key task scores.

Combine pruning with 4-bit quantization on a dev clone and compare orders (quant→prune vs prune→quant).

Optimization Features

Token Efficiency

  • Throughput gains (e.g., ~1.16x at 25% prune)

Model Optimization

  • Pruning
  • Layer removal (structured)

System Optimization

  • Works on heterogeneous GPU clusters used in experiments

Training Optimization

  • Accuracy

Inference Optimization

  • Reduced depth -> faster inference
  • Compatible with quantization

Reproducibility

Data Urls

  • PG19
  • MMLU
  • CMMLU
  • HellaSwag
  • PIQA
  • CHID
  • WSC
  • CoQA
  • BoolQ
  • RACE

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Generative tasks (summarization) can collapse after aggressive layer removal on smaller models.
  • Last layer FFN is critical; deleting final components can heavily hurt perplexity.
  • Effect depends on model size and architecture—RWKV showed weaker redundancy.
  • Post-training is needed to recover some lost quality for sensitive tasks.

When Not To Use

  • Generative-heavy applications without retraining
  • Small models where depth redundancy is weaker
  • When exact probability outputs / calibrated logits are required

Failure Modes

  • Accumulated errors in autoregressive generation leading to near-zero quality on summarization after heavy pruning
  • Order-dependent degradation when combining with quantization
  • Over-pruning important shallow or final-layer components if BI is computed on an unrepresentative calibration set

Core Entities

Models

  • LLaMA2-7B
  • LLaMA2-13B
  • Baichuan2-7B
  • Baichuan2-13B
  • RWKV-7B
  • Mamba-2.8B

Metrics

  • Block Influence (BI)
  • Perplexity
  • MMLU score
  • Throughput (tokens/s)
  • Performance retention (%)

Datasets

  • PG19
  • MMLU
  • CMMLU
  • HellaSwag
  • PIQA
  • CHID
  • WSC
  • CoQA
  • BoolQ
  • RACE
  • XSum
  • C3

Benchmarks

  • MMLU
  • CMMLU
  • HellaSwag
  • PIQA
  • BoolQ
  • CoQA
  • XSum
  • C3
  • RACE