Wanda: prune LLM weights by weight magnitude × input-activation norm — no retraining, much faster than prior LLM pruning

June 20, 20237 min

Overview

Decision SnapshotReady For Pilot

Wanda is ready for production experiments: it reliably finds useful 50% sparse models without retraining, is robust to small calibration sets, and dramatically reduces pruning compute cost; extreme sparsity or strict latency targets may still need fine-tuning or more careful engineering.

Citations52

Evidence Strength0.90

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Wanda makes one-shot LLM pruning cheap and practical: you can cut ~50% of parameters without retraining and with minimal calibration data, saving memory and possibly inference cost while keeping most model quality.

Who Should Care

Summary TLDR

Wanda is a one-shot pruning method for pretrained large language models that ranks each weight by |weight| times the ℓ2 norm of its input activation and prunes per output neuron. It needs no weight updates or retraining, is robust with very few calibration sequences, and matches or nearly matches the prior best sparse-one-shot method SparseGPT on many LLaMA/LLaMA-2 benchmarks while computing importance scores hundreds of times faster. Structured N:M variants (e.g., 2:4) yield practical inference speedups but can have larger accuracy loss without fine-tuning.

Problem Statement

Pruning billion-parameter LLMs is hard: classic magnitude pruning fails, and recent accurate methods require costly per-layer weight updates or second-order computations that are too slow or memory-heavy for large models. We need a simple, cheap way to find useful sparse sub-networks in pretrained LLMs without retraining.

Main Contribution

A simple importance score: per-weight |W_ij| × ℓ2-norm(input feature j) estimated from a small calibration set.

A per-output (per-row) comparison group: prune the lowest-scoring weights for each output neuron to keep balance across outputs.

Key Findings

Wanda greatly reduces language-modeling loss vs magnitude pruning on LLaMA-7B at 50% sparsity

NumbersPerplexity 7.26 (Wanda) vs 17.29 (magnitude) on WikiText (LLaMA-7B, 50%)

Practical UseIf you must prune an LLM without retraining, use Wanda not naive magnitude pruning to keep much lower perplexity.

Evidence RefTable 3

Wanda matches or nearly matches SparseGPT without any weight updates

NumbersPerplexity 7.26 (Wanda) vs 7.22 (SparseGPT); zero-shot 54.21% vs 54.94% (LLaMA-7B, 50%)

Practical UseYou can avoid SparseGPT’s heavy weight-update step and still get similar quality for common sparsity targets (e.g., 50%).

Evidence RefTable 2 and Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
WikiText perplexity (lower is better)7.26 (Wanda, LLaMA-7B, 50% sparsity)5.68 (dense LLaMA-7B)+1.58WikiText validationTable 3 LLaMA-7B 50% rowTable 3
Accuracy66.67% (Wanda, LLaMA-65B, unstructured 50%)66.97% (dense LLaMA-65B)-0.30EleutherAI LM Harness (7 zero-shot tasks)Table 2 LLaMA-65B 50% rowTable 2

What To Try In 7 Days

Run Wanda on a single LLaMA-family model to 50% unstructured sparsity and compare perplexity and zero-shot accuracy to dense and magnitude-pruned baselines.

Test calibration sensitivity: prune with 1, 16, 128 sequences from your domain to see robustness and choose 128 if available.

If you need runtime speedups, apply structured 2:4 masks and measure end-to-end latency using CUTLASS or your hardware GEMM kernels.

Optimization Features

Infra Optimization
Metric compute avoids matrix inverses; much lower CPU/GPU time vs SparseGPT
Model Optimization
Unstructured weight pruning (per-output)Structured N:M pruning (2:4, 4:8)Importance metric: |W_ij| × ||X_j||_2
System Optimization
N:M masks compatible with NVIDIA sparse tensor cores and CUTLASS kernels
Training Optimization
No retraining required to produce usable sparse modelLoRA
Inference Optimization
2:4 structured masks enable ~1.6× linear-layer GEMM speedupEnd-to-end speedup observed (e.g., 1.24× on LLaMA-7B)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

At extreme sparsity (e.g., ≥80%) performance degrades substantially; better to train a smaller dense model then.

Structured sparsity can hurt accuracy more than unstructured and sometimes benefits from fine-tuning.

When Not To Use

When targeting very high sparsity (≥70–80%) — SparseGPT with weight updates or retraining may help.

When you have abundant compute and plan to do heavy post-pruning retraining — weight-update methods may yield marginally better results.

Failure Modes

Poor calibration data leads to wrong activation norms and worse pruning decisions.

Very high sparsity can produce unusable models without iterative weight updates.

Core Entities

Models

LLaMALLaMA-2OPTBLOOMPythia

Metrics

PerplexityAccuracyInference latency (ms)

Datasets

WikiText (validation)C4 (calibration)

Benchmarks

EleutherAI LM Harness zero-shot tasksMMLU (5-shot)