Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
73
Why It Matters For Business
You can cut ~20% of an LLM's parameters to save memory and speed without losing most zero‑shot ability, and re-tune it in hours on one GPU with modest public data—much cheaper than retraining or distillation.
Summary TLDR
LLM-Pruner automatically finds dependent groups of weights in transformer LLMs and prunes groups (not individual weights). It uses gradient-based importance estimates and recovers performance quickly with LoRA (low-rank tuning) on only ~50k public samples in ~3 hours on one GPU. On LLaMA/Vicuna/ChatGLM, a 20% parameter cut keeps most zero-shot accuracy (≈90–95% of original after fast tuning) and reduces memory and latency materially. High pruning rates (≈50%) still cause large quality drops.
Problem Statement
Large LLMs are expensive to run. Prior compression either needs the original huge pretraining corpus or long post-training. We need a task-agnostic pruning method that (1) preserves multi-task/general capabilities, (2) minimizes dependence on original training data, and (3) recovers quickly with little compute and data.
Main Contribution
Automatic discovery of dependency groups in LLMs so coupled structures are pruned together.
Grouped importance estimation that uses first- and approximated second-order (gradient/Fisher) terms to rank groups for pruning.
Fast recovery using LoRA (low-rank adapters) on limited external data (≈50k samples) in ≈3 hours on one GPU.
Key Findings
20% parameter pruning can be recovered to near-original zero-shot accuracy with limited tuning
Fast recovery requires small public tuning data and short time
Dependency-aware grouped pruning greatly outperforms naive (independent) pruning
Pruning cuts memory and latency proportionally
Very high pruning rates still cause large quality drops even after tuning
Results
Accuracy
Params / Memory / Latency (LLaMA-7B)
Recovery cost
Accuracy
Large-data recovery result
Who Should Care
What To Try In 7 Days
Run LLM-Pruner on a dev copy of your LLaMA/Vicuna model and prune 15–25%; measure memory and latency.
Recover with LoRA using ~50k instruction-like samples (Alpaca) for 2 epochs on one GPU and compare zero-shot metrics.
Compare dependency-grouped pruning vs simple channel pruning to avoid fragile layers (don’t prune first/last layers).
Optimization Features
Infra Optimization
- enables single-GPU post-tuning instead of multi-GPU retraining
Model Optimization
- structured pruning (grouped/dependency-aware)
- group importance estimation using gradients and approximated Hessian/Fisher
System Optimization
- pruned models show lower latency on single-GPU inference
Training Optimization
- LoRA
- two-epoch recovery with small batches
Inference Optimization
- reduced parameter count and memory
- combine pruning with int8 quantization for further memory savings
Reproducibility
Data Urls
- https://github.com/tatsu-lab/stanford_alpaca
- WikiText2 and PTB (public)
- Dataset names listed in paper (BoolQ, PIQA, HellaSwag, WinoGrande, ARC, OBQA)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High pruning rates (≈50%) cause large quality loss even after recovery.
- Recovery quality depends on external tuning corpus; domain mismatch and overfitting are risks.
- Layer sensitivity: pruning uniformly (channel strategy) may remove critical first/last layers and hurt accuracy.
When Not To Use
- When you need maximum generation fidelity or long-form coherent text (avoid >40–50% pruning).
- If you lack any external tuning data and cannot run LoRA recovery.
- When strict guarantees on downstream task performance are required without validation.
Failure Modes
- Catastrophic drop in zero-shot performance if dependencies are ignored.
- Incoherent or repetitive long generations after pruning at high ratios.
- Overfitting to recovery dataset if tuned too long (observed after ~300–1000 steps).
Core Entities
Models
- LLaMA-7B
- LLaMA-13B
- Vicuna-7B
- ChatGLM-6B
- LLaMA-5.4B (pruned)
- LLaMA-3B (pruned)
Metrics
- Accuracy
- perplexity
- memory (MiB)
- latency (s)
- parameter count
Datasets
- Alpaca (≈50k)
- BookCorpus (calibration)
- DailyDialog (calibration for ChatGLM)
- WikiText2
- PTB
- BoolQ
- PIQA
- HellaSwag
- WinoGrande
- ARC-easy
- ARC-challenge
- OpenbookQA
Benchmarks
- Zero-shot classification (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OpenbookQA)
- Perplexity (WikiText2, PTB)

