Task‑agnostic structured pruning for LLMs that cuts params and memory, then recovers in hours with 50K samples

May 19, 20238 min

Overview

Decision SnapshotReady For Pilot

Method is practical: automatic group discovery + gradient-based ranking plus LoRA recovery makes pruning usable in low-resource settings; evidence covers multiple open models and ablations.

Citations73

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Xinyin Ma, Gongfan Fang, Xinchao Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cut ~20% of an LLM's parameters to save memory and speed without losing most zero‑shot ability, and re-tune it in hours on one GPU with modest public data—much cheaper than retraining or distillation.

Who Should Care

Summary TLDR

LLM-Pruner automatically finds dependent groups of weights in transformer LLMs and prunes groups (not individual weights). It uses gradient-based importance estimates and recovers performance quickly with LoRA (low-rank tuning) on only ~50k public samples in ~3 hours on one GPU. On LLaMA/Vicuna/ChatGLM, a 20% parameter cut keeps most zero-shot accuracy (≈90–95% of original after fast tuning) and reduces memory and latency materially. High pruning rates (≈50%) still cause large quality drops.

Problem Statement

Large LLMs are expensive to run. Prior compression either needs the original huge pretraining corpus or long post-training. We need a task-agnostic pruning method that (1) preserves multi-task/general capabilities, (2) minimizes dependence on original training data, and (3) recovers quickly with little compute and data.

Main Contribution

Automatic discovery of dependency groups in LLMs so coupled structures are pruned together.

Grouped importance estimation that uses first- and approximated second-order (gradient/Fisher) terms to rank groups for pruning.

Key Findings

20% parameter pruning can be recovered to near-original zero-shot accuracy with limited tuning

NumbersPruned 20% → average accuracy 60.07; baseline 63.2594.97% retained

Practical UsePrune ~20% to save memory/compute and use LoRA + 50k samples to recover most task-agnostic abilities in hours.

Evidence RefTable 1; main text (LLaMA results)

Fast recovery requires small public tuning data and short time

NumbersRecovery: 50k Alpaca samples, 2 epochs, ≈3 hours on one GPU

Practical UseYou can compress and re-tune an LLM on a single GPU with modest public data instead of multi‑day retraining.

Evidence RefImplementation Details (Sec.4.1) and B.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy60.07 (post-tune after 20% prune)63.25 (unpruned under eval prompt)-3.18 (≈-5.03%)7 classification datasets (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA)Element 2 importance, 20% prune + LoRA recoveryTable 1; main text
Params / Memory / Latency (LLaMA-7B)6.74B5.39B params; 12884.5MiB → 10363.6MiB; 69.32s → 61.50s6.74B params, 12884.5MiB, 69.32s≈20% params; memory -19.6%; latency -11.3%single-sentence inference, WikiText2 test (64 tokens)Measured inference stats for 20% pruningTable 3

What To Try In 7 Days

Run LLM-Pruner on a dev copy of your LLaMA/Vicuna model and prune 15–25%; measure memory and latency.

Recover with LoRA using ~50k instruction-like samples (Alpaca) for 2 epochs on one GPU and compare zero-shot metrics.

Compare dependency-grouped pruning vs simple channel pruning to avoid fragile layers (don’t prune first/last layers).

Optimization Features

Infra Optimization
enables single-GPU post-tuning instead of multi-GPU retraining
Model Optimization
structured pruning (grouped/dependency-aware)group importance estimation using gradients and approximated Hessian/Fisher
System Optimization
pruned models show lower latency on single-GPU inference
Training Optimization
LoRAtwo-epoch recovery with small batches
Inference Optimization
reduced parameter count and memorycombine pruning with int8 quantization for further memory savings

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/tatsu-lab/stanford_alpacaWikiText2 and PTB (public)Dataset names listed in paper (BoolQ, PIQA, HellaSwag, WinoGrande, ARC, OBQA)

Risks & Boundaries

Limitations

High pruning rates (≈50%) cause large quality loss even after recovery.

Recovery quality depends on external tuning corpus; domain mismatch and overfitting are risks.

When Not To Use

When you need maximum generation fidelity or long-form coherent text (avoid >40–50% pruning).

If you lack any external tuning data and cannot run LoRA recovery.

Failure Modes

Catastrophic drop in zero-shot performance if dependencies are ignored.

Incoherent or repetitive long generations after pruning at high ratios.

Core Entities

Models

LLaMA-7BLLaMA-13BVicuna-7BChatGLM-6BLLaMA-5.4B (pruned)LLaMA-3B (pruned)

Metrics

Accuracyperplexitymemory (MiB)latency (s)parameter count

Datasets

Alpaca (≈50k)BookCorpus (calibration)DailyDialog (calibration for ChatGLM)WikiText2PTBBoolQPIQAHellaSwagWinoGrandeARC-easyARC-challengeOpenbookQA

Benchmarks

Zero-shot classification (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OpenbookQA)Perplexity (WikiText2, PTB)