Task‑agnostic structured pruning for LLMs that cuts params and memory, then recovers in hours with 50K samples

May 19, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

73

Authors

Xinyin Ma, Gongfan Fang, Xinchao Wang

Links

Abstract / PDF

Why It Matters For Business

You can cut ~20% of an LLM's parameters to save memory and speed without losing most zero‑shot ability, and re-tune it in hours on one GPU with modest public data—much cheaper than retraining or distillation.

Summary TLDR

LLM-Pruner automatically finds dependent groups of weights in transformer LLMs and prunes groups (not individual weights). It uses gradient-based importance estimates and recovers performance quickly with LoRA (low-rank tuning) on only ~50k public samples in ~3 hours on one GPU. On LLaMA/Vicuna/ChatGLM, a 20% parameter cut keeps most zero-shot accuracy (≈90–95% of original after fast tuning) and reduces memory and latency materially. High pruning rates (≈50%) still cause large quality drops.

Problem Statement

Large LLMs are expensive to run. Prior compression either needs the original huge pretraining corpus or long post-training. We need a task-agnostic pruning method that (1) preserves multi-task/general capabilities, (2) minimizes dependence on original training data, and (3) recovers quickly with little compute and data.

Main Contribution

Automatic discovery of dependency groups in LLMs so coupled structures are pruned together.

Grouped importance estimation that uses first- and approximated second-order (gradient/Fisher) terms to rank groups for pruning.

Fast recovery using LoRA (low-rank adapters) on limited external data (≈50k samples) in ≈3 hours on one GPU.

Key Findings

20% parameter pruning can be recovered to near-original zero-shot accuracy with limited tuning

NumbersPruned 20% → average accuracy 60.07; baseline 63.25 → 94.97% retained

Fast recovery requires small public tuning data and short time

NumbersRecovery: 50k Alpaca samples, 2 epochs, ≈3 hours on one GPU

Dependency-aware grouped pruning greatly outperforms naive (independent) pruning

NumbersAverage accuracy w/o dependency 38.32 vs with dependency 56.69 (w/o tuning)

Pruning cuts memory and latency proportionally

NumbersLLaMA-7B: 6.74B→5.39B params (20%); memory 12884.5→10363.6 MiB; latency 69.32s→61.50s

Very high pruning rates still cause large quality drops even after tuning

Numbers50% pruning shows large accuracy and PPL degradation; tuning improves but gaps remain

Results

Accuracy

Value60.07 (post-tune after 20% prune)

Baseline63.25 (unpruned under eval prompt)

Params / Memory / Latency (LLaMA-7B)

Value6.74B → 5.39B params; 12884.5MiB → 10363.6MiB; 69.32s → 61.50s

Baseline6.74B params, 12884.5MiB, 69.32s

Recovery cost

Value≈3 hours on one GPU, 2 epochs, LoRA rank=8, batch 64–128, AdamW

Baselinefull post-training (days)

Accuracy

ValueWith dependency grouping: 56.69 (w/o tune); 59.23 (w/ tune)

BaselineWithout dependency grouping: 38.32 (w/o tune); 38.10 (w/ tune)

Large-data recovery result

ValueUsing 2.59M samples reduces gap to ~0.89% average

Baseline50k-sample recovery average 59.23

Who Should Care

What To Try In 7 Days

Run LLM-Pruner on a dev copy of your LLaMA/Vicuna model and prune 15–25%; measure memory and latency.

Recover with LoRA using ~50k instruction-like samples (Alpaca) for 2 epochs on one GPU and compare zero-shot metrics.

Compare dependency-grouped pruning vs simple channel pruning to avoid fragile layers (don’t prune first/last layers).

Optimization Features

Infra Optimization

  • enables single-GPU post-tuning instead of multi-GPU retraining

Model Optimization

  • structured pruning (grouped/dependency-aware)
  • group importance estimation using gradients and approximated Hessian/Fisher

System Optimization

  • pruned models show lower latency on single-GPU inference

Training Optimization

  • LoRA
  • two-epoch recovery with small batches

Inference Optimization

  • reduced parameter count and memory
  • combine pruning with int8 quantization for further memory savings

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High pruning rates (≈50%) cause large quality loss even after recovery.
  • Recovery quality depends on external tuning corpus; domain mismatch and overfitting are risks.
  • Layer sensitivity: pruning uniformly (channel strategy) may remove critical first/last layers and hurt accuracy.

When Not To Use

  • When you need maximum generation fidelity or long-form coherent text (avoid >40–50% pruning).
  • If you lack any external tuning data and cannot run LoRA recovery.
  • When strict guarantees on downstream task performance are required without validation.

Failure Modes

  • Catastrophic drop in zero-shot performance if dependencies are ignored.
  • Incoherent or repetitive long generations after pruning at high ratios.
  • Overfitting to recovery dataset if tuned too long (observed after ~300–1000 steps).

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • Vicuna-7B
  • ChatGLM-6B
  • LLaMA-5.4B (pruned)
  • LLaMA-3B (pruned)

Metrics

  • Accuracy
  • perplexity
  • memory (MiB)
  • latency (s)
  • parameter count

Datasets

  • Alpaca (≈50k)
  • BookCorpus (calibration)
  • DailyDialog (calibration for ChatGLM)
  • WikiText2
  • PTB
  • BoolQ
  • PIQA
  • HellaSwag
  • WinoGrande
  • ARC-easy
  • ARC-challenge
  • OpenbookQA

Benchmarks

  • Zero-shot classification (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OpenbookQA)
  • Perplexity (WikiText2, PTB)