ShortGPT: remove low-impact layers to cut ~25% size while keeping ≈90% of performance

Overview

Decision SnapshotNeeds Validation

The method is simple and reproducible: BI is easy to compute and removal is one-shot. Evidence comes from multiple models and benchmarks, but heavy drops on generative tasks and per-model variability mean production use needs per-task validation.

Citations15

Evidence Strength0.70

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen

Links

Abstract / PDF / Data

Why It Matters For Business

ShortGPT offers a low-effort way to cut model size and inference cost by removing low-impact layers; it preserves most classification-style performance and stacks with quantization to save more compute.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Founder Data Scientist

Summary TLDR

The paper finds substantial redundancy across LLM layers and offers a very simple pruning recipe: measure each layer's Block Influence (BI), then remove the lowest-BI layers. Across many open models and benchmarks ShortGPT removes ~25% of layers/parameters while retaining roughly 85–92% of original performance on classification-style benchmarks. The method is model-agnostic (applies to some non-transformers), works with quantization, and can be partly recovered with lightweight post-training.

Problem Statement

Large LLMs are expensive to run. The authors argue many layers change hidden states very little and can be removed safely. They ask: can we rank and delete redundant layers to shrink models and speed inference without heavy retraining?

Main Contribution

Block Influence (BI): a simple metric that measures how much a layer changes hidden states.

ShortGPT: rank layers by BI and remove the lowest-impact layers as a one-shot structured prune.

Key Findings

Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.

NumbersMMLU 55.0 -> 52.2 (25% layers removed)

Practical UseYou can cut ~25% of layers with a modest drop on knowledge benchmarks; try this first when faster inference is needed.

Evidence RefIntroduction (example) / Table 2

ShortGPT (BI-based layer removal) preserves more benchmark performance than other structured pruning baselines at similar prune ratios.

NumbersAverage performance retention ≈86.3% (ShortGPT) vs 80.4% (LaCo) at ~27% pruning

Practical UseFor coarse pruning (~20–30% reduction), prefer BI-based layer removal over width-reduction or PCA-style methods.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MMLU	55.0 -> 52.2	55.0	-2.8	LLaMA2-13B, 25% layers removed (10/40)	Intro example	Introduction
Average performance retention	~86.3%	100%	-13.7 pp	ShortGPT vs Dense at ~27% pruning (Table 2)	ShortGPT avg retention ≈86.31% at 27.1% prune	Table 2

What To Try In 7 Days

Run BI on a small unlabeled calibration set (PG19) to rank layers.

Remove a few layers (start with ~10–25%) and measure perplexity and key task scores.

Combine pruning with 4-bit quantization on a dev clone and compare orders (quant→prune vs prune→quant).

Optimization Features

Token Efficiency

Throughput gains (e.g., ~1.16x at 25% prune)

Model Optimization

PruningLayer removal (structured)

System Optimization

Works on heterogeneous GPU clusters used in experiments

Training Optimization

Accuracy

Inference Optimization

Reduced depth -> faster inferenceCompatible with quantization

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

PG19MMLUCMMLUHellaSwagPIQACHIDWSCCoQABoolQRACE

Risks & Boundaries

Limitations

Generative tasks (summarization) can collapse after aggressive layer removal on smaller models.

Last layer FFN is critical; deleting final components can heavily hurt perplexity.

When Not To Use

Generative-heavy applications without retraining

Small models where depth redundancy is weaker

Failure Modes

Accumulated errors in autoregressive generation leading to near-zero quality on summarization after heavy pruning

Order-dependent degradation when combining with quantization

Core Entities

Models

LLaMA2-7BLLaMA2-13BBaichuan2-7BBaichuan2-13BRWKV-7BMamba-2.8B

Metrics

Block Influence (BI)PerplexityMMLU scoreThroughput (tokens/s)Performance retention (%)

Datasets

PG19MMLUCMMLUHellaSwagPIQACHIDWSCCoQABoolQRACEXSumC3

Benchmarks

MMLUCMMLUHellaSwagPIQABoolQCoQAXSumC3RACE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.

ShortGPT (BI-based layer removal) preserves more benchmark performance than other structured pruning baselines at similar prune ratios.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding