Trainable structured pruning + a 'collaborative' prompt compresses LLaMA-7B to 5.4B while keeping accuracy

October 8, 20238 min

Overview

Decision SnapshotNeeds Validation

The method shows clear gains on LLaMA-7B across many benchmarks and uses modest compute (4x V100). Results are limited to reported LLaMA variants and public instruction datasets, so expect some engineering to adapt to other models or data.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Song Guo, Jiahang Xu, Li Lyna Zhang, Mao Yang

Links

Abstract / PDF

Why It Matters For Business

Compresso lets teams reduce LLM size and inference cost with modest training and public instruction data while keeping near-original accuracy; this lowers deployment cost on standard GPUs without specialized hardware.

Who Should Care

Summary TLDR

Compresso is a training-based structured pruning method for large language models (LLMs). It uses LoRA (low-rank adapters) plus L0 regularization to learn binary masks that remove attention heads, FFN units, and hidden dims while freezing original weights. A short, task-style "collaborative" prompt instructs the model during pruning and inference, improving adaptation. On LLaMA-7B Compresso produces 5.4B / 5.0B / 4.5B variants that largely retain zero-shot and few-shot performance and outperform a structured one-shot baseline (LLM-Pruner) on several benchmarks.

Problem Statement

Structured pruning can cut real inference cost but is hard for LLMs. One-shot (no-training) pruning is cheap but hurts quality; training-based pruning can do better but is extremely memory- and data-hungry. The paper asks whether a memory-efficient, training-based method plus an LLM-aware prompt can learn better layer-wise pruning and recover accuracy under resource constraints.

Main Contribution

A memory-efficient training-based structured pruning pipeline that freezes base weights and learns binary masks while updating only LoRA adapters (4.54M trainable params for LLaMA-7B).

A collaborative pruning prompt that instructs the LLM during pruning and inference to adapt to removed parameters, improving final accuracy.

Key Findings

Compresso prunes LLaMA-7B to 5.4B while preserving most capabilities.

Numbers5.4B retains ~96% of LLaMA-7B commonsense accuracy

Practical UseYou can reduce model size by ~23% (7B→5.4B) with small loss on commonsense tasks; good option when you need lower memory and near-original accuracy.

Evidence RefTable 2; Sec. 4.2

Compresso beats a structured one-shot baseline across tasks.

NumbersUp to +2.21% commonsense, +11.43% reading, +7.04% MMLU, +4.81% BBH

Practical UseTraining-based structured pruning with learned masks plus a prompt outperforms one-shot methods; expect better retained accuracy if you can afford modest training.

Evidence RefAbstract; Sec. 4.2 (Tables 2–4)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyCompresso 5.4B ~60.09LLaMA-7B 62.19retains ~96% (−2.1 pts)StoryCloze, PIQA, HellaSwag, WinoGrande, ARC-e/c, OBQATable 2; Sec. 4.2Table 2
Reading comprehension (zero-shot, avg)Compresso 5.4B 60.35LLaMA-7B 57.73+2.62% vs LLaMA-7B (avg)BoolQ, RACE-HighTable 3; Sec. 4.2Table 3

What To Try In 7 Days

Run LoRA-only fine-tuning + L0 mask training on a small instruction dataset to test mask learning for your model.

Add a short 'pruning instruction' prompt to training and inference to see if model adaptation improves.

Compare model size vs accuracy trade-offs at one or two target sparsities (e.g., 5.4B from 7B) and measure latency/memory on your hardware.

Optimization Features

Infra Optimization
training fits on 4x V100 GPUs as reported
Model Optimization
structured pruning (attention heads, FFN units, hidden dims)layer-wise learned sparsity (automatic per-layer masks)
System Optimization
compatible with standard GPU hardware (no special sparse kernels required)
Training Optimization
LoRAL0 (hard-concrete) regularization to control sparsity
Inference Optimization
smaller model variants (5.4B/5.0B/4.5B) for lower memory and compute

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations limited to LLaMA-7B/13B; behavior on other architectures is untested.

Requires instruction-tuning style data; generic calibration data performed worse.

When Not To Use

If you need ultra-fast one-shot pruning with zero training budget.

When you lack instruction-tuning style data or cannot run any adapter training.

Failure Modes

Higher sparsity can cause larger drops in few-shot tasks (MMLU declines more than zero-shot commonsense).

Post fine-tuning sometimes harms specific benchmarks (e.g., BBH at 4.5B).

Core Entities

Models

LLaMA-7BLLaMA-13BCompresso (pruned variants 5.4B, 5.0B, 4.5B)LLM-PrunerSparseGPTWanda

Metrics

Accuracyperplexityremaining parameters (model size)relative % change vs baseline

Datasets

GPT4-Alpaca (instruction tuning, 52K)C4 subsetLLM-QAT

Benchmarks

StoryClozePIQAHellaSwagWinoGrandeARC-eARC-cOpenBookQABoolQRACE-HighMMLU (5-shot)BBH (3-shot)