Many pre-trained transformers already contain a large "free" sparse subnetwork you can remove with little cost

June 6, 20239 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

6

Authors

Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang

Links

Abstract / PDF

Why It Matters For Business

You can cut 30–50% of many pre-trained transformers with one cheap pruning pass and little accuracy loss, saving memory and inference costs without costly retraining.

Summary TLDR

The authors show that large pre-trained transformers (vision and language) contain "essential sparsity": a range of low-magnitude weights that can be one-shot removed from the checkpoint with almost no downstream fine-tuning loss. Across many models and tasks they find roughly 30–50% of weights can be dropped with <=1% task drop, the pattern holds for structured N:M pruning and for a 7B LLM (Vicuna). Self-supervised pretraining and larger pretraining data volumes tend to increase this free sparsity. Within the essential range, simple one-shot magnitude pruning matches expensive Lottery Ticket masks in performance and mask similarity.

Problem Statement

Large pre-trained transformers are expensive to store and run. Current pruning methods often require repeated train-prune-retrain cycles, which are costly at large scale. The paper asks: do high-quality sparse subnetworks already exist in pre-trained checkpoints that can be found cheaply, and how do pretraining choices affect this property?

Main Contribution

Define and empirically document "essential sparsity": a sharp turning point in the sparsity vs performance curve where removing a little more hurts rapidly.

Show essential sparsity across many pre-trained vision and language models (BERT, OPT, ViT, DiNO) and tasks; roughly 30–50% of weights removable with <=1% drop on evaluated benchmarks.

Demonstrate essential sparsity also applies to hardware-friendly N:M structured sparsity and to a modern LLM (Vicuna-7B).

Show self-supervised pretraining (e.g., DINO) and larger pretraining data volumes lead to stronger emergent sparsity and better prunability.

Compare one-shot magnitude pruning (free) to Lottery Ticket (expensive) and find near-equal downstream performance and very high mask cosine similarity in the essential-sparsity range.

Key Findings

Pre-trained transformers have an 'essential sparsity' range where many weights can be removed with minimal loss.

NumbersAbout 30–50% weights removable with <=1% downstream drop (evaluated tasks)

Within the essential-sparsity range, masks from one-shot magnitude pruning and Lottery Ticket pruning are very similar.

NumbersCosine similarity >96% (often >98%) and negligible performance gap for sparsity ≤50%

Self-supervised pretraining produces checkpoints that are more prune-friendly than supervised pretraining.

NumbersDINO-base has ~14.37% more zero weights than supervised ViT; shows better robustness to pruning

Sparsification can emerge abruptly during pretraining.

NumbersAbrupt zero-weight growth around 22–25k iterations for large-data runs; ~40k for smaller-data runs (BERT experiments)

Essential sparsity generalizes to structured N:M patterns and modern LLMs.

NumbersN:M masks on OPT-350M and one-shot pruning on Vicuna-7B show similar sharp-turn behavior and free removable mass

Results

Free removable weights (essential sparsity)

Value≈30–50% weights removable with ≤1% downstream drop on evaluated tasks

Baselinedense checkpoint

Mask similarity (OMP vs LTH)

ValueCosine similarity >96% (often >98%) within essential-sparsity range

BaselineLTH masks

SSL vs SL zero weights

ValueDINO-base has ~14.37% more zero weights than supervised ViT

Baselinesupervised ViT-base

Abrupt sparsification timing

ValueSharp growth in zero weights around 22–25k iterations (full data) and ~40k iterations (reduced data)

Baselinepretraining iteration counts

LLM one-shot prune behavior

ValueVicuna-7B shows similar essential-sparsity trend; OMP and SparseGPT both yield usable sparsity

Baselinedense Vicuna-7B

Who Should Care

What To Try In 7 Days

Apply one-shot magnitude pruning (OMP) at 30% on your pre-trained checkpoints and validate on key downstream tasks.

If you use SSL-pretrained models, test higher OMP rates (35–45%)—they often tolerate more removal.

For production LLM inference, test one-shot pruning + SparseGPT-style refinements to find a balance between speed and accuracy.

Optimization Features

Model Optimization

  • one-shot magnitude pruning (OMP)
  • structured N:M sparsity
  • SparseGPT (refinement)

Training Optimization

  • none specific; observation: self-supervised pretraining induces more prune-friendly weights

Inference Optimization

  • potential hardware speedup via N:M patterns
  • reduced memory footprint from 30–50% fewer weights

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments cover many models but not the largest foundation models beyond Vicuna-7B.
  • Most downstream tests use fine-tuning; only Vicuna-7B was evaluated zero-shot.
  • Hardware speedups for N:M patterns are asserted but not measured end-to-end.
  • One-shot magnitude pruning fails beyond the essential range; LTH remains useful at very high sparsity.

When Not To Use

  • When you require extreme sparsity (>50%)—iterative LTH-style methods may be needed.
  • When exact latency/speed benchmarks on target hardware are required (this paper reports pruning trends but not runtime measurements).
  • When per-task bespoke masks are essential—OMP masks are task-agnostic and may underperform in extreme cases.

Failure Modes

  • Abrupt performance collapse if pruning crosses the essential sparsity threshold.
  • Mask similarity between OMP and LTH decreases as sparsity increases, so OMP may fail at high sparsity.
  • Structured pruning constraints (N:M) can change which weights are removable without extra tuning.

Core Entities

Models

  • BERT-base
  • BERT-large
  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • ViT-Base
  • ViT-Large
  • DiNO-Base
  • Vicuna-7B

Metrics

  • sparsity ratio
  • Accuracy
  • cosine similarity between masks

Datasets

  • GLUE (MNLI, QNLI, QQP, SST-2, RTE)
  • SQuAD v1.1
  • CIFAR-10
  • CIFAR-100
  • Tiny-ImageNet
  • MAWPS
  • ASDiv-A
  • SVAMP
  • SMC-Bench
  • BookCorpus
  • MMLU

Benchmarks

  • GLUE
  • SMC-Bench
  • MMLU