Wanda: prune LLM weights by weight magnitude × input-activation norm — no retraining, much faster than prior LLM pruning

June 20, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

52

Authors

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

Links

Abstract / PDF

Why It Matters For Business

Wanda makes one-shot LLM pruning cheap and practical: you can cut ~50% of parameters without retraining and with minimal calibration data, saving memory and possibly inference cost while keeping most model quality.

Summary TLDR

Wanda is a one-shot pruning method for pretrained large language models that ranks each weight by |weight| times the ℓ2 norm of its input activation and prunes per output neuron. It needs no weight updates or retraining, is robust with very few calibration sequences, and matches or nearly matches the prior best sparse-one-shot method SparseGPT on many LLaMA/LLaMA-2 benchmarks while computing importance scores hundreds of times faster. Structured N:M variants (e.g., 2:4) yield practical inference speedups but can have larger accuracy loss without fine-tuning.

Problem Statement

Pruning billion-parameter LLMs is hard: classic magnitude pruning fails, and recent accurate methods require costly per-layer weight updates or second-order computations that are too slow or memory-heavy for large models. We need a simple, cheap way to find useful sparse sub-networks in pretrained LLMs without retraining.

Main Contribution

A simple importance score: per-weight |W_ij| × ℓ2-norm(input feature j) estimated from a small calibration set.

A per-output (per-row) comparison group: prune the lowest-scoring weights for each output neuron to keep balance across outputs.

A no-retraining, one-shot pruning pipeline that matches SparseGPT on many tasks while avoiding weight updates and heavy matrix inverses.

Empirical evaluation across LLaMA/LLaMA-2 and other LLM families showing robustness to few calibration samples and extensions to structured N:M sparsity.

Key Findings

Wanda greatly reduces language-modeling loss vs magnitude pruning on LLaMA-7B at 50% sparsity

NumbersPerplexity 7.26 (Wanda) vs 17.29 (magnitude) on WikiText (LLaMA-7B, 50%)

Wanda matches or nearly matches SparseGPT without any weight updates

NumbersPerplexity 7.26 (Wanda) vs 7.22 (SparseGPT); zero-shot 54.21% vs 54.94% (LLaMA-7B, 50%)

Computing Wanda's pruning metric is orders of magnitude faster than SparseGPT's metric computation

NumbersMetric compute 0.54s (Wanda) vs 203.1s (SparseGPT) for LLaMA-7B

Wanda is robust with very few calibration samples

NumbersWith 1 calibration sequence, PPL 7.46 (Wanda) vs 10.22 (SparseGPT) on LLaMA-7B

Per-output comparison groups improve pruning quality for LLMs

NumbersPer-output grouping yields PPL 7.26 vs layer-wise 7.95 (LLaMA-7B, Wanda, 50%)

Results

WikiText perplexity (lower is better)

Value7.26 (Wanda, LLaMA-7B, 50% sparsity)

Baseline5.68 (dense LLaMA-7B)

Accuracy

Value66.67% (Wanda, LLaMA-65B, unstructured 50%)

Baseline66.97% (dense LLaMA-65B)

Pruning metric compute time (per full model pass, seconds)

Value0.54s (Wanda, LLaMA-7B)

Baseline203.1s (SparseGPT, LLaMA-7B)

Inference speedup (structured 2:4)

Value≈1.63× linear-layer GEMM speedup; end-to-end 1.24× on LLaMA-7B

Baselinedense

Who Should Care

What To Try In 7 Days

Run Wanda on a single LLaMA-family model to 50% unstructured sparsity and compare perplexity and zero-shot accuracy to dense and magnitude-pruned baselines.

Test calibration sensitivity: prune with 1, 16, 128 sequences from your domain to see robustness and choose 128 if available.

If you need runtime speedups, apply structured 2:4 masks and measure end-to-end latency using CUTLASS or your hardware GEMM kernels.

Optimization Features

Infra Optimization

  • Metric compute avoids matrix inverses; much lower CPU/GPU time vs SparseGPT

Model Optimization

  • Unstructured weight pruning (per-output)
  • Structured N:M pruning (2:4, 4:8)
  • Importance metric: |W_ij| × ||X_j||_2

System Optimization

  • N:M masks compatible with NVIDIA sparse tensor cores and CUTLASS kernels

Training Optimization

  • No retraining required to produce usable sparse model
  • LoRA

Inference Optimization

  • 2:4 structured masks enable ~1.6× linear-layer GEMM speedup
  • End-to-end speedup observed (e.g., 1.24× on LLaMA-7B)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • At extreme sparsity (e.g., ≥80%) performance degrades substantially; better to train a smaller dense model then.
  • Structured sparsity can hurt accuracy more than unstructured and sometimes benefits from fine-tuning.
  • Per-output grouping advantage appears specific to LLMs; not always helpful for image models.
  • Wanda relies on calibration data distribution matching inference; domain mismatch may reduce effectiveness.

When Not To Use

  • When targeting very high sparsity (≥70–80%) — SparseGPT with weight updates or retraining may help.
  • When you have abundant compute and plan to do heavy post-pruning retraining — weight-update methods may yield marginally better results.
  • For non-LLM models where per-output grouping is not beneficial (e.g., some image classifiers).

Failure Modes

  • Poor calibration data leads to wrong activation norms and worse pruning decisions.
  • Very high sparsity can produce unusable models without iterative weight updates.
  • Structured N:M masks may need careful placement and fine-tuning to avoid large accuracy drops.

Core Entities

Models

  • LLaMA
  • LLaMA-2
  • OPT
  • BLOOM
  • Pythia

Metrics

  • Perplexity
  • Accuracy
  • Inference latency (ms)

Datasets

  • WikiText (validation)
  • C4 (calibration)

Benchmarks

  • EleutherAI LM Harness zero-shot tasks
  • MMLU (5-shot)