Many pre-trained transformers already contain a large "free" sparse subnetwork you can remove with little cost

June 6, 20239 min

Overview

Decision SnapshotReady For Pilot

The paper provides broad empirical evidence across multiple models and tasks that simple one-shot pruning finds large, usable sparse masks; results are reproducible on public checkpoints but limited to the evaluated model families and tasks.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 65%

Novelty: 60%

Authors

Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang

Links

Abstract / PDF / Code

Why It Matters For Business

You can cut 30–50% of many pre-trained transformers with one cheap pruning pass and little accuracy loss, saving memory and inference costs without costly retraining.

Who Should Care

Summary TLDR

The authors show that large pre-trained transformers (vision and language) contain "essential sparsity": a range of low-magnitude weights that can be one-shot removed from the checkpoint with almost no downstream fine-tuning loss. Across many models and tasks they find roughly 30–50% of weights can be dropped with <=1% task drop, the pattern holds for structured N:M pruning and for a 7B LLM (Vicuna). Self-supervised pretraining and larger pretraining data volumes tend to increase this free sparsity. Within the essential range, simple one-shot magnitude pruning matches expensive Lottery Ticket masks in performance and mask similarity.

Problem Statement

Large pre-trained transformers are expensive to store and run. Current pruning methods often require repeated train-prune-retrain cycles, which are costly at large scale. The paper asks: do high-quality sparse subnetworks already exist in pre-trained checkpoints that can be found cheaply, and how do pretraining choices affect this property?

Main Contribution

Define and empirically document "essential sparsity": a sharp turning point in the sparsity vs performance curve where removing a little more hurts rapidly.

Show essential sparsity across many pre-trained vision and language models (BERT, OPT, ViT, DiNO) and tasks; roughly 30–50% of weights removable with <=1% drop on evaluated benchmarks.

Key Findings

Pre-trained transformers have an 'essential sparsity' range where many weights can be removed with minimal loss.

NumbersAbout 3050% weights removable with <=1% downstream drop (evaluated tasks)

Practical UseBefore expensive pruning, try one-shot magnitude pruning at 30–40% to cut model size with almost no fine-tune cost

Evidence RefFigures 2–4 and text: '∼30-50% of weights can be removed at free without any sig

Within the essential-sparsity range, masks from one-shot magnitude pruning and Lottery Ticket pruning are very similar.

NumbersCosine similarity >96% (often >98%) and negligible performance gap for sparsity ≤50%

Practical UsePrefer cheap one-shot magnitude pruning (OMP) over costly iterative LTH for moderate sparsity (≤50%)

Evidence RefFigures 9–10 and text: 'high cosine similarity (>96% , >98%)'

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Free removable weights (essential sparsity)≈3050% weights removable with ≤1% downstream drop on evaluated tasksdense checkpoint3050% fewer parametersGLUE, SQuAD, CIFAR variants, SMC-BenchFigures 2–4 and text reporting '∼30-50% of weights can be removed at free without any significant drop'Figures 2–4
Mask similarity (OMP vs LTH)Cosine similarity >96% (often >98%) within essential-sparsity rangeLTH maskshigh agreement in mask structureBERT-base and ViT-base on multiple downstream tasksFigures 9–10 and text: 'surprisingly high cosine similarity (>96% , >98%)'Figures 9–10

What To Try In 7 Days

Apply one-shot magnitude pruning (OMP) at 30% on your pre-trained checkpoints and validate on key downstream tasks.

If you use SSL-pretrained models, test higher OMP rates (35–45%)—they often tolerate more removal.

For production LLM inference, test one-shot pruning + SparseGPT-style refinements to find a balance between speed and accuracy.

Optimization Features

Model Optimization
one-shot magnitude pruning (OMP)structured N:M sparsitySparseGPT (refinement)
Training Optimization
none specific; observation: self-supervised pretraining induces more prune-friendly weights
Inference Optimization
potential hardware speedup via N:M patternsreduced memory footprint from 30–50% fewer weights

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments cover many models but not the largest foundation models beyond Vicuna-7B.

Most downstream tests use fine-tuning; only Vicuna-7B was evaluated zero-shot.

When Not To Use

When you require extreme sparsity (>50%)—iterative LTH-style methods may be needed.

When exact latency/speed benchmarks on target hardware are required (this paper reports pruning trends but not runtime measurements).

Failure Modes

Abrupt performance collapse if pruning crosses the essential sparsity threshold.

Mask similarity between OMP and LTH decreases as sparsity increases, so OMP may fail at high sparsity.

Core Entities

Models

BERT-baseBERT-largeOPT-125MOPT-350MOPT-1.3BViT-BaseViT-LargeDiNO-BaseVicuna-7B

Metrics

sparsity ratioAccuracycosine similarity between masks

Datasets

GLUE (MNLI, QNLI, QQP, SST-2, RTE)SQuAD v1.1CIFAR-10CIFAR-100Tiny-ImageNetMAWPSASDiv-ASVAMPSMC-BenchBookCorpusMMLU

Benchmarks

GLUESMC-BenchMMLU