Trainable structured pruning + a 'collaborative' prompt compresses LLaMA-7B to 5.4B while keeping accuracy

October 8, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Song Guo, Jiahang Xu, Li Lyna Zhang, Mao Yang

Links

Abstract / PDF

Why It Matters For Business

Compresso lets teams reduce LLM size and inference cost with modest training and public instruction data while keeping near-original accuracy; this lowers deployment cost on standard GPUs without specialized hardware.

Summary TLDR

Compresso is a training-based structured pruning method for large language models (LLMs). It uses LoRA (low-rank adapters) plus L0 regularization to learn binary masks that remove attention heads, FFN units, and hidden dims while freezing original weights. A short, task-style "collaborative" prompt instructs the model during pruning and inference, improving adaptation. On LLaMA-7B Compresso produces 5.4B / 5.0B / 4.5B variants that largely retain zero-shot and few-shot performance and outperform a structured one-shot baseline (LLM-Pruner) on several benchmarks.

Problem Statement

Structured pruning can cut real inference cost but is hard for LLMs. One-shot (no-training) pruning is cheap but hurts quality; training-based pruning can do better but is extremely memory- and data-hungry. The paper asks whether a memory-efficient, training-based method plus an LLM-aware prompt can learn better layer-wise pruning and recover accuracy under resource constraints.

Main Contribution

A memory-efficient training-based structured pruning pipeline that freezes base weights and learns binary masks while updating only LoRA adapters (4.54M trainable params for LLaMA-7B).

A collaborative pruning prompt that instructs the LLM during pruning and inference to adapt to removed parameters, improving final accuracy.

Extensive evaluation on LLaMA-7B pruning to 5.4B/5.0B/4.5B showing retained generalization and consistent gains over a structured one-shot baseline across commonsense, reading, MMLU and BBH benchmarks.

Key Findings

Compresso prunes LLaMA-7B to 5.4B while preserving most capabilities.

Numbers5.4B retains ~96% of LLaMA-7B commonsense accuracy

Compresso beats a structured one-shot baseline across tasks.

NumbersUp to +2.21% commonsense, +11.43% reading, +7.04% MMLU, +4.81% BBH

A collaborative pruning prompt measurably helps.

NumbersRemoving prompt: commonsense −3.11 to −5.68 pts, reading −3.83 to −7.64 pts

Instruction-tuning data works better than generic calibration data for pruning.

NumbersGPT4-Alpaca yields 60.09 vs C4 56.41 on commonsense; reading 60.35 vs 52.78

LoRA + L0 masks reduce training memory footprint.

NumbersTrainable params: masks 0.35M + LoRA 4.19M = 4.54M for LLaMA-7B

Results

Accuracy

ValueCompresso 5.4B ~60.09

BaselineLLaMA-7B 62.19

Reading comprehension (zero-shot, avg)

ValueCompresso 5.4B 60.35

BaselineLLaMA-7B 57.73

MMLU (5-shot)

ValueCompresso 5.4B 31.90

BaselineLLaMA-7B 36.80

BBH (3-shot, avg)

ValueCompresso 5.4B 31.47

BaselineLLaMA-7B 32.34

Improvement vs LLM-Pruner (best reported)

ValueCompresso > LLM-Pruner

BaselineLLM-Pruner (one-shot structured)

Who Should Care

What To Try In 7 Days

Run LoRA-only fine-tuning + L0 mask training on a small instruction dataset to test mask learning for your model.

Add a short 'pruning instruction' prompt to training and inference to see if model adaptation improves.

Compare model size vs accuracy trade-offs at one or two target sparsities (e.g., 5.4B from 7B) and measure latency/memory on your hardware.

Optimization Features

Infra Optimization

  • training fits on 4x V100 GPUs as reported

Model Optimization

  • structured pruning (attention heads, FFN units, hidden dims)
  • layer-wise learned sparsity (automatic per-layer masks)

System Optimization

  • compatible with standard GPU hardware (no special sparse kernels required)

Training Optimization

  • LoRA
  • L0 (hard-concrete) regularization to control sparsity

Inference Optimization

  • smaller model variants (5.4B/5.0B/4.5B) for lower memory and compute

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations limited to LLaMA-7B/13B; behavior on other architectures is untested.
  • Requires instruction-tuning style data; generic calibration data performed worse.
  • Needs modest training resources (authors used 4 V100 GPUs).
  • Collaborative prompt was manually designed with GPT-4 help; prompt design may need tuning per model/task.

When Not To Use

  • If you need ultra-fast one-shot pruning with zero training budget.
  • When you lack instruction-tuning style data or cannot run any adapter training.
  • If you must guarantee unchanged behavior on very different downstream tasks not covered by instruction data.

Failure Modes

  • Higher sparsity can cause larger drops in few-shot tasks (MMLU declines more than zero-shot commonsense).
  • Post fine-tuning sometimes harms specific benchmarks (e.g., BBH at 4.5B).
  • Layer-wise mask learning could prune critical units if pruning targets are extreme.

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • Compresso (pruned variants 5.4B, 5.0B, 4.5B)
  • LLM-Pruner
  • SparseGPT
  • Wanda

Metrics

  • Accuracy
  • perplexity
  • remaining parameters (model size)
  • relative % change vs baseline

Datasets

  • GPT4-Alpaca (instruction tuning, 52K)
  • C4 subset
  • LLM-QAT

Benchmarks

  • StoryCloze
  • PIQA
  • HellaSwag
  • WinoGrande
  • ARC-e
  • ARC-c
  • OpenBookQA
  • BoolQ
  • RACE-High
  • MMLU (5-shot)
  • BBH (3-shot)