Overview
Production Readiness
0.8
Novelty Score
0.7
Cost Impact Score
0.9
Citation Count
69
Why It Matters For Business
SparseGPT can cut model memory and inference compute roughly in half for massive GPT models, enabling cheaper hosting and faster inference without retraining. Joint sparsity+quantization can match lower-bit storage with better accuracy than pure quantization.
Summary TLDR
SparseGPT is a fast post-training pruning method that can make very large GPT-family models (e.g., OPT-175B, BLOOM-176B) 50–60% sparse in one shot, without finetuning, while keeping perplexity and zero-shot accuracy nearly intact. The method reuses layer Hessian information to perform efficient weight reconstruction, supports hardware-friendly 2:4 and 4:8 patterns, and can be combined with weight quantization (e.g., joint 50% sparsity + 4-bit) for further memory savings. The authors provide code and report runs on a single A100 GPU (≈4 hours for 175B).
Problem Statement
Large GPT-family models are expensive to store and serve. Existing accurate pruning methods need expensive retraining or do not scale to 10–100B+ parameters. We need a fast, accurate post-training pruning method that works at GPT scale without retraining.
Main Contribution
SparseGPT: a one-shot, post-training pruning algorithm that scales to 10–100+B parameter GPT models without finetuning.
An efficient approximate reconstruction solver that reuses a sequence of inverse Hessians across columns to reduce computation and memory.
Support for unstructured sparsity and semi-structured n:m patterns (2:4 and 4:8), plus joint sparsification+quantization in one pass.
Empirical results on OPT and BLOOM families showing 50–60% sparsity with small accuracy loss and runs on a single A100 (≈4h for 175B).
Key Findings
Large GPT models can be pruned to 50–60% unstructured sparsity in one shot with little accuracy loss.
SparseGPT runs quickly on a single GPU for the largest open models.
Joint sparsity + quantization outperforms storage-equivalent pure quantization on evaluated models.
Larger models are easier to sparsify: accuracy drop at fixed sparsity decreases with model size.
Results
Perplexity (OPT-175B dense)
Perplexity (OPT-175B, 50% sparsity + 4-bit joint)
Accuracy
Sparsification runtime (OPT-175B)
CPU end-to-end inference speedup (OPT-2.7B)
Who Should Care
What To Try In 7 Days
Run SparseGPT on a large model you already use and profile memory and latency (use 128 calibration segments).
Try joint 50% sparsity + 4-bit quantization and compare to your current quantized model for accuracy vs storage.
If targeting GPU speedups, test semi-structured 2:4/4:8 patterns and measure real end-to-end latency with your inference stack.
Optimization Features
Infra Optimization
- Single-GPU (A100 80GB) runnable for 175B models in hours
Model Optimization
- Post-training pruning (one-shot)
- Unstructured pruning
- Semi-structured n:m pruning (2:4, 4:8)
System Optimization
- Column-block lazy updates to improve compute-to-memory ratio
Training Optimization
- No retraining / no finetuning (post-training)
Inference Optimization
- Reduced weight memory footprint
- Potential inference speedups on CPU (DeepSparse) and GPUs supporting n:m kernels
Reproducibility
Data Urls
- C4 (first shard) used for calibration; public C4 dataset referenced in paper
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Semi-structured patterns (2:4,4:8) reduce accuracy more on smaller models than on the largest ones.
- Real end-to-end speedups depend on hardware and inference software; reported layer speedups may not equal full-system gains.
- SparseGPT excludes embeddings and the head from pruning in experiments.
- Method is post-training and does not replace retraining-based pipelines when fine-grained structured pruning is required.
When Not To Use
- If you need strictly lossless accuracy for your task (no tolerance for any PPL or task drop).
- If your deployment stack lacks optimized sparse kernels or does not support 2:4-style acceleration.
- If you require structured pruning at the row/column level (different toolset needed).
Failure Modes
- Simple magnitude pruning collapses accuracy at moderate sparsities for GPT-scale models (observed >30% collapse).
- Joint semi-structured sparsity may harm smaller models more than large ones; poor layer selection for partial n:m leads to accuracy loss.
- Calibration data distribution mismatch could reduce pruning quality, though authors report robustness to seed sampling.
Core Entities
Models
- OPT-175B
- BLOOM-176B
- OPT family (2.7B,6.7B,13B,30B,66B,175B)
Metrics
- Perplexity
- Accuracy
- End-to-end inference speedup
Datasets
- raw-WikiText2
- PTB
- C4 subset
- Lambada
- ARC (easy/challenge)
- PIQA
- StoryCloze
Benchmarks
- Perplexity (language modeling)
- Accuracy
Context Entities
Models
- AdaPrune (baseline)
- Magnitude pruning (baseline)
- GPTQ (joint quantization baseline)
Metrics
- Perplexity (HuggingFace procedure)
- Accuracy
Datasets
- C4 (used for calibration: 128 segments × 2048 tokens)
Benchmarks
- raw-WikiText2, PTB, C4 subset, Lambada, ARC, PIQA, StoryCloze

