Wanda: prune LLM weights by weight magnitude × input-activation norm — no retraining, much faster than prior LLM pruning

Overview

Decision SnapshotReady For Pilot

Wanda is ready for production experiments: it reliably finds useful 50% sparse models without retraining, is robust to small calibration sets, and dramatically reduces pruning compute cost; extreme sparsity or strict latency targets may still need fine-tuning or more careful engineering.

Citations52

Evidence Strength0.90

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Wanda makes one-shot LLM pruning cheap and practical: you can cut ~50% of parameters without retraining and with minimal calibration data, saving memory and possibly inference cost while keeping most model quality.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist Product Manager

Summary TLDR

Wanda is a one-shot pruning method for pretrained large language models that ranks each weight by |weight| times the ℓ2 norm of its input activation and prunes per output neuron. It needs no weight updates or retraining, is robust with very few calibration sequences, and matches or nearly matches the prior best sparse-one-shot method SparseGPT on many LLaMA/LLaMA-2 benchmarks while computing importance scores hundreds of times faster. Structured N:M variants (e.g., 2:4) yield practical inference speedups but can have larger accuracy loss without fine-tuning.

Problem Statement

Pruning billion-parameter LLMs is hard: classic magnitude pruning fails, and recent accurate methods require costly per-layer weight updates or second-order computations that are too slow or memory-heavy for large models. We need a simple, cheap way to find useful sparse sub-networks in pretrained LLMs without retraining.

Main Contribution

A simple importance score: per-weight |W_ij| × ℓ2-norm(input feature j) estimated from a small calibration set.

A per-output (per-row) comparison group: prune the lowest-scoring weights for each output neuron to keep balance across outputs.

Key Findings

Wanda greatly reduces language-modeling loss vs magnitude pruning on LLaMA-7B at 50% sparsity

NumbersPerplexity 7.26 (Wanda) vs 17.29 (magnitude) on WikiText (LLaMA-7B, 50%)

Practical UseIf you must prune an LLM without retraining, use Wanda not naive magnitude pruning to keep much lower perplexity.

Evidence RefTable 3

Wanda matches or nearly matches SparseGPT without any weight updates

NumbersPerplexity 7.26 (Wanda) vs 7.22 (SparseGPT); zero-shot 54.21% vs 54.94% (LLaMA-7B, 50%)

Practical UseYou can avoid SparseGPT’s heavy weight-update step and still get similar quality for common sparsity targets (e.g., 50%).

Evidence RefTable 2 and Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
WikiText perplexity (lower is better)	7.26 (Wanda, LLaMA-7B, 50% sparsity)	5.68 (dense LLaMA-7B)	+1.58	WikiText validation	Table 3 LLaMA-7B 50% row	Table 3
Accuracy	66.67% (Wanda, LLaMA-65B, unstructured 50%)	66.97% (dense LLaMA-65B)	-0.30	EleutherAI LM Harness (7 zero-shot tasks)	Table 2 LLaMA-65B 50% row	Table 2

What To Try In 7 Days

Run Wanda on a single LLaMA-family model to 50% unstructured sparsity and compare perplexity and zero-shot accuracy to dense and magnitude-pruned baselines.

Test calibration sensitivity: prune with 1, 16, 128 sequences from your domain to see robustness and choose 128 if available.

If you need runtime speedups, apply structured 2:4 masks and measure end-to-end latency using CUTLASS or your hardware GEMM kernels.

Optimization Features

Infra Optimization

Metric compute avoids matrix inverses; much lower CPU/GPU time vs SparseGPT

Model Optimization

Unstructured weight pruning (per-output)Structured N:M pruning (2:4, 4:8)Importance metric: |W_ij| × ||X_j||_2

System Optimization

N:M masks compatible with NVIDIA sparse tensor cores and CUTLASS kernels

Training Optimization

No retraining required to produce usable sparse modelLoRA

Inference Optimization

2:4 structured masks enable ~1.6× linear-layer GEMM speedupEnd-to-end speedup observed (e.g., 1.24× on LLaMA-7B)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/locuslab/wanda

Data URLs

https://huggingface.co/datasets/wikitext https://www.tensorflow.org/datasets/catalog/c4

Risks & Boundaries

Limitations

At extreme sparsity (e.g., ≥80%) performance degrades substantially; better to train a smaller dense model then.

Structured sparsity can hurt accuracy more than unstructured and sometimes benefits from fine-tuning.

When Not To Use

When targeting very high sparsity (≥70–80%) — SparseGPT with weight updates or retraining may help.

When you have abundant compute and plan to do heavy post-pruning retraining — weight-update methods may yield marginally better results.

Failure Modes

Poor calibration data leads to wrong activation norms and worse pruning decisions.

Very high sparsity can produce unusable models without iterative weight updates.

Core Entities

Models

LLaMALLaMA-2OPTBLOOMPythia

Metrics

PerplexityAccuracyInference latency (ms)

Datasets

WikiText (validation)C4 (calibration)

Benchmarks

EleutherAI LM Harness zero-shot tasksMMLU (5-shot)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Wanda greatly reduces language-modeling loss vs magnitude pruning on LLaMA-7B at 50% sparsity

Wanda matches or nearly matches SparseGPT without any weight updates

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding