Prune 50–60% of GPT-scale weights in one pass, no retraining, with minor accuracy loss

January 2, 20237 min

Overview

Production Readiness

0.8

Novelty Score

0.7

Cost Impact Score

0.9

Citation Count

69

Authors

Elias Frantar, Dan Alistarh

Links

Abstract / PDF

Why It Matters For Business

SparseGPT can cut model memory and inference compute roughly in half for massive GPT models, enabling cheaper hosting and faster inference without retraining. Joint sparsity+quantization can match lower-bit storage with better accuracy than pure quantization.

Summary TLDR

SparseGPT is a fast post-training pruning method that can make very large GPT-family models (e.g., OPT-175B, BLOOM-176B) 50–60% sparse in one shot, without finetuning, while keeping perplexity and zero-shot accuracy nearly intact. The method reuses layer Hessian information to perform efficient weight reconstruction, supports hardware-friendly 2:4 and 4:8 patterns, and can be combined with weight quantization (e.g., joint 50% sparsity + 4-bit) for further memory savings. The authors provide code and report runs on a single A100 GPU (≈4 hours for 175B).

Problem Statement

Large GPT-family models are expensive to store and serve. Existing accurate pruning methods need expensive retraining or do not scale to 10–100B+ parameters. We need a fast, accurate post-training pruning method that works at GPT scale without retraining.

Main Contribution

SparseGPT: a one-shot, post-training pruning algorithm that scales to 10–100+B parameter GPT models without finetuning.

An efficient approximate reconstruction solver that reuses a sequence of inverse Hessians across columns to reduce computation and memory.

Support for unstructured sparsity and semi-structured n:m patterns (2:4 and 4:8), plus joint sparsification+quantization in one pass.

Empirical results on OPT and BLOOM families showing 50–60% sparsity with small accuracy loss and runs on a single A100 (≈4h for 175B).

Key Findings

Large GPT models can be pruned to 50–60% unstructured sparsity in one shot with little accuracy loss.

Numbers50–60% sparsity; removes ≈100B weights from OPT-175B/BLOOM-176B

SparseGPT runs quickly on a single GPU for the largest open models.

NumbersOPT-175B / BLOOM-176B sparsification ≲4.5 hours on one A100 (80GB)

Joint sparsity + quantization outperforms storage-equivalent pure quantization on evaluated models.

NumbersOPT-175B: 50% + 4-bit PPL 8.29 vs GPTQ 3-bit PPL 8.68 (reported)

Larger models are easier to sparsify: accuracy drop at fixed sparsity decreases with model size.

NumbersAt 66B params there is essentially zero perplexity loss at 50% sparsity on evaluated datasets

Results

Perplexity (OPT-175B dense)

Value8.35

Perplexity (OPT-175B, 50% sparsity + 4-bit joint)

Value8.29

BaselineDense (8.35)

Accuracy

Value70.52

BaselineDense (70.29)

Sparsification runtime (OPT-175B)

Value≈4 hours

CPU end-to-end inference speedup (OPT-2.7B)

Value1.82× at 50% sparsity

Baselinedense

Who Should Care

What To Try In 7 Days

Run SparseGPT on a large model you already use and profile memory and latency (use 128 calibration segments).

Try joint 50% sparsity + 4-bit quantization and compare to your current quantized model for accuracy vs storage.

If targeting GPU speedups, test semi-structured 2:4/4:8 patterns and measure real end-to-end latency with your inference stack.

Optimization Features

Infra Optimization

  • Single-GPU (A100 80GB) runnable for 175B models in hours

Model Optimization

  • Post-training pruning (one-shot)
  • Unstructured pruning
  • Semi-structured n:m pruning (2:4, 4:8)

System Optimization

  • Column-block lazy updates to improve compute-to-memory ratio

Training Optimization

  • No retraining / no finetuning (post-training)

Inference Optimization

  • Reduced weight memory footprint
  • Potential inference speedups on CPU (DeepSparse) and GPUs supporting n:m kernels

Reproducibility

Data Urls

  • C4 (first shard) used for calibration; public C4 dataset referenced in paper

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Semi-structured patterns (2:4,4:8) reduce accuracy more on smaller models than on the largest ones.
  • Real end-to-end speedups depend on hardware and inference software; reported layer speedups may not equal full-system gains.
  • SparseGPT excludes embeddings and the head from pruning in experiments.
  • Method is post-training and does not replace retraining-based pipelines when fine-grained structured pruning is required.

When Not To Use

  • If you need strictly lossless accuracy for your task (no tolerance for any PPL or task drop).
  • If your deployment stack lacks optimized sparse kernels or does not support 2:4-style acceleration.
  • If you require structured pruning at the row/column level (different toolset needed).

Failure Modes

  • Simple magnitude pruning collapses accuracy at moderate sparsities for GPT-scale models (observed >30% collapse).
  • Joint semi-structured sparsity may harm smaller models more than large ones; poor layer selection for partial n:m leads to accuracy loss.
  • Calibration data distribution mismatch could reduce pruning quality, though authors report robustness to seed sampling.

Core Entities

Models

  • OPT-175B
  • BLOOM-176B
  • OPT family (2.7B,6.7B,13B,30B,66B,175B)

Metrics

  • Perplexity
  • Accuracy
  • End-to-end inference speedup

Datasets

  • raw-WikiText2
  • PTB
  • C4 subset
  • Lambada
  • ARC (easy/challenge)
  • PIQA
  • StoryCloze

Benchmarks

  • Perplexity (language modeling)
  • Accuracy

Context Entities

Models

  • AdaPrune (baseline)
  • Magnitude pruning (baseline)
  • GPTQ (joint quantization baseline)

Metrics

  • Perplexity (HuggingFace procedure)
  • Accuracy

Datasets

  • C4 (used for calibration: 128 segments × 2048 tokens)

Benchmarks

  • raw-WikiText2, PTB, C4 subset, Lambada, ARC, PIQA, StoryCloze