Trainable structured pruning + a 'collaborative' prompt compresses LLaMA-7B to 5.4B while keeping accuracy

Overview

Decision SnapshotNeeds Validation

The method shows clear gains on LLaMA-7B across many benchmarks and uses modest compute (4x V100). Results are limited to reported LLaMA variants and public instruction datasets, so expect some engineering to adapt to other models or data.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Song Guo, Jiahang Xu, Li Lyna Zhang, Mao Yang

Links

Abstract / PDF

Why It Matters For Business

Compresso lets teams reduce LLM size and inference cost with modest training and public instruction data while keeping near-original accuracy; this lowers deployment cost on standard GPUs without specialized hardware.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

Compresso is a training-based structured pruning method for large language models (LLMs). It uses LoRA (low-rank adapters) plus L0 regularization to learn binary masks that remove attention heads, FFN units, and hidden dims while freezing original weights. A short, task-style "collaborative" prompt instructs the model during pruning and inference, improving adaptation. On LLaMA-7B Compresso produces 5.4B / 5.0B / 4.5B variants that largely retain zero-shot and few-shot performance and outperform a structured one-shot baseline (LLM-Pruner) on several benchmarks.

Problem Statement

Structured pruning can cut real inference cost but is hard for LLMs. One-shot (no-training) pruning is cheap but hurts quality; training-based pruning can do better but is extremely memory- and data-hungry. The paper asks whether a memory-efficient, training-based method plus an LLM-aware prompt can learn better layer-wise pruning and recover accuracy under resource constraints.

Main Contribution

A memory-efficient training-based structured pruning pipeline that freezes base weights and learns binary masks while updating only LoRA adapters (4.54M trainable params for LLaMA-7B).

A collaborative pruning prompt that instructs the LLM during pruning and inference to adapt to removed parameters, improving final accuracy.

Key Findings

Compresso prunes LLaMA-7B to 5.4B while preserving most capabilities.

Numbers5.4B retains ~96% of LLaMA-7B commonsense accuracy

Practical UseYou can reduce model size by ~23% (7B→5.4B) with small loss on commonsense tasks; good option when you need lower memory and near-original accuracy.

Evidence RefTable 2; Sec. 4.2

Compresso beats a structured one-shot baseline across tasks.

NumbersUp to +2.21% commonsense, +11.43% reading, +7.04% MMLU, +4.81% BBH

Practical UseTraining-based structured pruning with learned masks plus a prompt outperforms one-shot methods; expect better retained accuracy if you can afford modest training.

Evidence RefAbstract; Sec. 4.2 (Tables 2–4)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Compresso 5.4B ~60.09	LLaMA-7B 62.19	retains ~96% (−2.1 pts)	StoryCloze, PIQA, HellaSwag, WinoGrande, ARC-e/c, OBQA	Table 2; Sec. 4.2	Table 2
Reading comprehension (zero-shot, avg)	Compresso 5.4B 60.35	LLaMA-7B 57.73	+2.62% vs LLaMA-7B (avg)	BoolQ, RACE-High	Table 3; Sec. 4.2	Table 3

What To Try In 7 Days

Run LoRA-only fine-tuning + L0 mask training on a small instruction dataset to test mask learning for your model.

Add a short 'pruning instruction' prompt to training and inference to see if model adaptation improves.

Compare model size vs accuracy trade-offs at one or two target sparsities (e.g., 5.4B from 7B) and measure latency/memory on your hardware.

Optimization Features

Infra Optimization

training fits on 4x V100 GPUs as reported

Model Optimization

structured pruning (attention heads, FFN units, hidden dims)layer-wise learned sparsity (automatic per-layer masks)

System Optimization

compatible with standard GPU hardware (no special sparse kernels required)

Training Optimization

LoRAL0 (hard-concrete) regularization to control sparsity

Inference Optimization

smaller model variants (5.4B/5.0B/4.5B) for lower memory and compute

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluations limited to LLaMA-7B/13B; behavior on other architectures is untested.

Requires instruction-tuning style data; generic calibration data performed worse.

When Not To Use

If you need ultra-fast one-shot pruning with zero training budget.

When you lack instruction-tuning style data or cannot run any adapter training.

Failure Modes

Higher sparsity can cause larger drops in few-shot tasks (MMLU declines more than zero-shot commonsense).

Post fine-tuning sometimes harms specific benchmarks (e.g., BBH at 4.5B).

Core Entities

Models

LLaMA-7BLLaMA-13BCompresso (pruned variants 5.4B, 5.0B, 4.5B)LLM-PrunerSparseGPTWanda

Metrics

Accuracyperplexityremaining parameters (model size)relative % change vs baseline

Datasets

GPT4-Alpaca (instruction tuning, 52K)C4 subsetLLM-QAT

Benchmarks

StoryClozePIQAHellaSwagWinoGrandeARC-eARC-cOpenBookQABoolQRACE-HighMMLU (5-shot)BBH (3-shot)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Compresso prunes LLaMA-7B to 5.4B while preserving most capabilities.

Compresso beats a structured one-shot baseline across tasks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding