Prune LLMs with LoRA-aware dependency graphs, progressive pruning, and dynamic recovery to cut footprint with limited GPUs

October 24, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

4

Authors

Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang

Links

Abstract / PDF

Why It Matters For Business

LoRAShear aims to cut LLM parameter footprint with a workflow that fits a single A100 GPU, enabling smaller teams to try structured pruning without large clusters while keeping practical accuracy on some benchmarks.

Summary TLDR

LoRAShear is a practical pipeline to structurally prune large language models (LLMs) that were adapted with LoRA (a low-rank adapter). It finds minimal removable groups via a dependency graph, prunes them progressively with a new optimizer (LHSPG) that transfers knowledge into remaining parts, then recovers lost capability via an adaptive two-stage fine-tuning on subsets of pretraining and instruction data. Authors report running the pipeline on LLAMA v1 on one A100 in a few GPU-days and claim a 20% parameter reduction with ~1% performance loss and stronger results than several prior methods on common benchmarks.

Problem Statement

Large LLMs cost a lot to run and are hard to prune safely when you only have limited compute and no access to the full original training runs. Existing structured pruners either need heavy compute or do not integrate LoRA adapters, leading to large accuracy drops. The paper aims to enable hardware-friendly structured pruning for LoRA-adapted LLMs under limited GPU resources while recovering lost knowledge.

Main Contribution

Dependency-graph discovery that includes LoRA modules to find minimally removable structures before pruning.

LHSPG: a progressive structured-sparsity optimizer that uses LoRA variables to transfer knowledge from pruned groups into survivors.

Dynamic knowledge recovery: iterative fine-tuning on adaptively sampled subsets of pretraining and instruction data to regain general and instruction knowledge.

Practical demonstration on LLAMA v1 showing a limited-GPU workflow (one A100, a few GPU-days) and comparisons to recent pruning methods.

Key Findings

Authors report a 20% parameter pruning with only ~1.0% performance regression to the full LLAMA v1

Numbers20% prune → 1.0% regression (paper claim)

Reported benchmark averages in Table 1 show LoRAShear at 20% pruning scored 62.22 average vs full LLAMA v1 68.59 average

Numbers20% prune: 62.22 vs 68.59 baseline (−6.37 abs)

Authors show a 50% pruning result with LoRAShear that yields average 51.63 on the same benchmarks

Numbers50% prune: 51.63 average (Table 1)

The pipeline runs on limited hardware: one A100 for 'a couple of GPU days' according to the authors

Numbers1 A100 GPU, a couple GPU-days (paper statement)

Results

Average benchmark score (lm-evaluation-harness)

Value68.59

BaselineLLAMAv1 full model average

LoRA

Value62.22

Baseline68.59 (full LLAMAv1)

LoRA

Value51.63

Baseline68.59 (full LLAMAv1)

BoolQ (example task)

Value72.78

Baseline76.5 (full LLAMAv1)

Compute cost

Value1 A100, couple GPU-days

Who Should Care

What To Try In 7 Days

Run LoRA adapters on your target LLM and measure baseline task scores.

Construct simple dependency graphs (trace-based) and flag the highest-sensitivity node groups by pruning them temporarily and measuring dev set drops.

Try progressive pruning with small target ratios (10–20%) and validate recovery by fine-tuning on a focused pretraining subset plus your instruction data.

Optimization Features

Infra Optimization

  • designed to run on one A100 GPU in a few GPU-days

Model Optimization

  • structured pruning of minimally removable groups
  • LoRA
  • LHSPG optimizer for progressive group projection to zero

System Optimization

  • two-pass dependency traversal to construct compressed model

Training Optimization

  • LoRA
  • progressive multi-period pruning schedule
  • dynamic data subset selection for recovery fine-tuning

Inference Optimization

  • reduced parameter count / structured sparsity for denser, hardware-friendly models

Reproducibility

Data Urls

  • OpenWebText, Wikipedia dump, Gutenberg, BookCorpus, Alpaca (these are public datasets referenced in paper)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments reported only on open-source LLAMA v1; generality to other families not shown.
  • Some summary claims (e.g., 1% loss at 20% prune) do not fully align with table averages, so per-task effects vary.
  • Code is not yet public at time of writing, hindering immediate reproduction.

When Not To Use

  • If you need guaranteed no drop in critical tasks; pruning can still cause significant per-task loss.
  • If you have access to large pretraining compute and prefer full retrain-based methods like Sheared-LLaMA.

Failure Modes

  • Pruning sensitive node groups causes large, hard-to-recover drops (first/last node groups were most sensitive).
  • Saliency proxies can mis-rank groups, leading to removal of important minimally-removable structures.
  • Dynamic recovery may overfit if selected pretraining subsets are not balanced across sources.

Core Entities

Models

  • LLAMAv1
  • LoRA
  • LHSPG

Metrics

  • Accuracy
  • perplexity deviation (knowledge analysis)
  • average benchmark score

Datasets

  • OpenWebText
  • Processed Wikipedia (2022)
  • Gutenberg
  • BookCorpus
  • Alpaca (instruction dataset)

Benchmarks

  • BoolQ
  • PIQA
  • HellaSwag
  • WinoGrande
  • ARC-e
  • ARC-c
  • OBQA
  • Average (lm-evaluation-harness)

Context Entities

Models

  • LLM-Pruner
  • LoRA
  • WANDA
  • Sheared-LLaMA

Datasets

  • CommonCrawl (not used directly)
  • C4 (not used directly)