BESA: differentiable block-wise pruning that learns layer sparsity — prunes 7B–70B models on one A100 in hours

Overview

Decision SnapshotReady For Pilot

BESA is practically usable: it runs on one A100 for 7B–70B models, uses small calibration sets, and improves perplexity versus existing one-shot methods; actual runtime gains depend on hardware that supports irregular sparsity.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

Links

Abstract / PDF / Code

Why It Matters For Business

BESA makes aggressive pruning of large LLMs practical on a single A100 GPU, preserving model quality and enabling lower-cost deployment or faster inference when paired with quantization.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

BESA is a practical pruning method for transformer LLMs that (1) prunes by minimizing reconstruction error at the transformer-block level instead of per-layer, and (2) learns layer-specific sparsity via a small set of differentiable parameters. It works on LLaMA-family models (7B–70B tested) on a single A100 GPU in hours and can be jointly optimized with weight-only quantization. Across perplexity and zero-shot tasks, BESA consistently outperforms prior one-shot pruning methods (SparseGPT, Wanda) at similar sparsity, and simulated hardware shows realistic speedups when accelerator hardware supports irregular sparsity.

Problem Statement

Current one-shot LLM pruning methods prune per layer with a hand-chosen uniform sparsity. That causes error to accumulate across layers and forces manual tuning of per-layer sparsity. The paper targets a fast, memory-efficient method that finds layer-wise sparsity automatically and reduces output degradation after pruning.

Main Contribution

A block-wise pruning framework that minimizes reconstruction error at the transformer-block level, reducing error accumulation.

A parameter-efficient, differentiable method to learn layer-specific sparsity using a small set of combination coefficients and straight-through gradients.

Key Findings

Lower perplexity than prior one-shot methods at 50% unstructured sparsity on LLaMA models.

NumbersExample: LLaMA2-70B Wikitext2 ppl BESA 4.09 vs SparseGPT 4.25 (Table 1)

Practical UseChoosing BESA reduces language-model perplexity loss after aggressive pruning compared with SparseGPT/Wanda, so use BESA when you need higher post-prune quality.

Evidence RefTable 1

Prunes large LLMs quickly on a single GPU.

Numbers50% prune of LLaMA2-70B in ~5 hours on a single A100-80GB (paper claim)

Practical UseYou can prune 7B–70B models without multi-GPU clusters — feasible for small teams or single-GPU pipelines.

Evidence RefAbstract; Sec.4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (Wikitext2)	4.09	SparseGPT	-0.16	LLaMA2-70B, 50% unstructured sparsity	BESA 4.09 vs SparseGPT 4.25 on Wikitext2	Table 1
Perplexity (Wikitext2)	6.86	Wanda	-0.40	LLaMA-7B, 50% unstructured sparsity	BESA 6.86 vs Wanda 7.26 on Wikitext2	Table 1

What To Try In 7 Days

Run BESA on a smaller LLaMA (7B) with 128 calibration sequences and D=100 to reproduce baseline gains.

Try 50% unstructured pruning + 4-bit quantization to test memory and latency trade-offs.

Simulate runtime on your target accelerator or use ViTCoD-like simulator to estimate real speedups.

Optimization Features

Infra Optimization

Single A100-80GB pruning for 7B–70B modelsViTCoD simulator used to estimate accelerator-level speedups

Model Optimization

Unstructured weight pruningLearnable per-layer sparsityJoint weight-only quantization

System Optimization

Block-wise pipeline lowers peak GPU memory (prune one block at a time)Custom CUDA operator for row-wise mask generation (paper implementation)

Training Optimization

Differentiable sparsity allocation via combination coefficientsStraight-through estimator for mask gradientsBlock-wise reconstruction loss to reduce error accumulation

Inference Optimization

Layer-adaptive sparsity to reduce unnecessary computeDemonstrated simulated speedups on sparsity-aware accelerator

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Code URLs

code link referenced in paper (not provided in text)

Risks & Boundaries

Limitations

Unstructured sparsity may not map directly to standard n:m GPU speedups; requires accelerator support or specialized kernels.

Performance depends on calibration data (but gains plateau after ≈64 samples).

When Not To Use

If you need standard n:m structured sparsity for cuSPARSELt acceleration on NVIDIA GPUs.

If you cannot provide a small calibration set or run the block-wise pipeline.

Failure Modes

Insufficient calibration samples can increase perplexity degradation.

Incorrect choice of sparsity-step or D may cause convergence issues.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA-30BLLaMA-65BLLaMA2-7BLLaMA2-13BLLaMA2-70B

Metrics

PerplexityAccuracySimulated runtime cycles / speedup

Datasets

WikiText2C4PTBCalibration set (128 sequences from C4)

Benchmarks

PIQABoolQHellaSwagWinoGrandeARC EasyARC Challenge

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Lower perplexity than prior one-shot methods at 50% unstructured sparsity on LLaMA models.

Prunes large LLMs quickly on a single GPU.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding