Many pre-trained transformers already contain a large "free" sparse subnetwork you can remove with little cost

Overview

Decision SnapshotReady For Pilot

The paper provides broad empirical evidence across multiple models and tasks that simple one-shot pruning finds large, usable sparse masks; results are reproducible on public checkpoints but limited to the evaluated model families and tasks.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 65%

Novelty: 60%

Authors

Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang

Links

Abstract / PDF / Code

Why It Matters For Business

You can cut 30–50% of many pre-trained transformers with one cheap pruning pass and little accuracy loss, saving memory and inference costs without costly retraining.

Who Should Care

ML Engineer Engineering Lead CTO Founder Product Manager

Summary TLDR

The authors show that large pre-trained transformers (vision and language) contain "essential sparsity": a range of low-magnitude weights that can be one-shot removed from the checkpoint with almost no downstream fine-tuning loss. Across many models and tasks they find roughly 30–50% of weights can be dropped with <=1% task drop, the pattern holds for structured N:M pruning and for a 7B LLM (Vicuna). Self-supervised pretraining and larger pretraining data volumes tend to increase this free sparsity. Within the essential range, simple one-shot magnitude pruning matches expensive Lottery Ticket masks in performance and mask similarity.

Problem Statement

Large pre-trained transformers are expensive to store and run. Current pruning methods often require repeated train-prune-retrain cycles, which are costly at large scale. The paper asks: do high-quality sparse subnetworks already exist in pre-trained checkpoints that can be found cheaply, and how do pretraining choices affect this property?

Main Contribution

Define and empirically document "essential sparsity": a sharp turning point in the sparsity vs performance curve where removing a little more hurts rapidly.

Show essential sparsity across many pre-trained vision and language models (BERT, OPT, ViT, DiNO) and tasks; roughly 30–50% of weights removable with <=1% drop on evaluated benchmarks.

Key Findings

Pre-trained transformers have an 'essential sparsity' range where many weights can be removed with minimal loss.

NumbersAbout 30–50% weights removable with <=1% downstream drop (evaluated tasks)

Practical UseBefore expensive pruning, try one-shot magnitude pruning at 30–40% to cut model size with almost no fine-tune cost

Evidence RefFigures 2–4 and text: '∼30-50% of weights can be removed at free without any sig

Within the essential-sparsity range, masks from one-shot magnitude pruning and Lottery Ticket pruning are very similar.

NumbersCosine similarity >96% (often >98%) and negligible performance gap for sparsity ≤50%

Practical UsePrefer cheap one-shot magnitude pruning (OMP) over costly iterative LTH for moderate sparsity (≤50%)

Evidence RefFigures 9–10 and text: 'high cosine similarity (>96% , >98%)'

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Free removable weights (essential sparsity)	≈30–50% weights removable with ≤1% downstream drop on evaluated tasks	dense checkpoint	30–50% fewer parameters	GLUE, SQuAD, CIFAR variants, SMC-Bench	Figures 2–4 and text reporting '∼30-50% of weights can be removed at free without any significant drop'	Figures 2–4
Mask similarity (OMP vs LTH)	Cosine similarity >96% (often >98%) within essential-sparsity range	LTH masks	high agreement in mask structure	BERT-base and ViT-base on multiple downstream tasks	Figures 9–10 and text: 'surprisingly high cosine similarity (>96% , >98%)'	Figures 9–10

What To Try In 7 Days

Apply one-shot magnitude pruning (OMP) at 30% on your pre-trained checkpoints and validate on key downstream tasks.

If you use SSL-pretrained models, test higher OMP rates (35–45%)—they often tolerate more removal.

For production LLM inference, test one-shot pruning + SparseGPT-style refinements to find a balance between speed and accuracy.

Optimization Features

Model Optimization

one-shot magnitude pruning (OMP)structured N:M sparsitySparseGPT (refinement)

Training Optimization

none specific; observation: self-supervised pretraining induces more prune-friendly weights

Inference Optimization

potential hardware speedup via N:M patternsreduced memory footprint from 30–50% fewer weights

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/VITA-Group/essential_sparsity

Risks & Boundaries

Limitations

Experiments cover many models but not the largest foundation models beyond Vicuna-7B.

Most downstream tests use fine-tuning; only Vicuna-7B was evaluated zero-shot.

When Not To Use

When you require extreme sparsity (>50%)—iterative LTH-style methods may be needed.

When exact latency/speed benchmarks on target hardware are required (this paper reports pruning trends but not runtime measurements).

Failure Modes

Abrupt performance collapse if pruning crosses the essential sparsity threshold.

Mask similarity between OMP and LTH decreases as sparsity increases, so OMP may fail at high sparsity.

Core Entities

Models

BERT-baseBERT-largeOPT-125MOPT-350MOPT-1.3BViT-BaseViT-LargeDiNO-BaseVicuna-7B

Metrics

sparsity ratioAccuracycosine similarity between masks

Datasets

GLUE (MNLI, QNLI, QQP, SST-2, RTE)SQuAD v1.1CIFAR-10CIFAR-100Tiny-ImageNetMAWPSASDiv-ASVAMPSMC-BenchBookCorpusMMLU

Benchmarks

GLUESMC-BenchMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Pre-trained transformers have an 'essential sparsity' range where many weights can be removed with minimal loss.

Within the essential-sparsity range, masks from one-shot magnitude pruning and Lottery Ticket pruning are very similar.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Measure many LLMs with only a few test items by learning weighted anchors

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Trainable structured pruning + a 'collaborative' prompt compresses LLaMA-7B to 5.4B while keeping accuracy

Key finding

Practical survey of how to combine fine-tuned LLMs into one model without retraining

Key finding