BESA: differentiable block-wise pruning that learns layer sparsity — prunes 7B–70B models on one A100 in hours

February 18, 20247 min

Overview

Decision SnapshotReady For Pilot

BESA is practically usable: it runs on one A100 for 7B–70B models, uses small calibration sets, and improves perplexity versus existing one-shot methods; actual runtime gains depend on hardware that supports irregular sparsity.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

Links

Abstract / PDF / Code

Why It Matters For Business

BESA makes aggressive pruning of large LLMs practical on a single A100 GPU, preserving model quality and enabling lower-cost deployment or faster inference when paired with quantization.

Who Should Care

Summary TLDR

BESA is a practical pruning method for transformer LLMs that (1) prunes by minimizing reconstruction error at the transformer-block level instead of per-layer, and (2) learns layer-specific sparsity via a small set of differentiable parameters. It works on LLaMA-family models (7B–70B tested) on a single A100 GPU in hours and can be jointly optimized with weight-only quantization. Across perplexity and zero-shot tasks, BESA consistently outperforms prior one-shot pruning methods (SparseGPT, Wanda) at similar sparsity, and simulated hardware shows realistic speedups when accelerator hardware supports irregular sparsity.

Problem Statement

Current one-shot LLM pruning methods prune per layer with a hand-chosen uniform sparsity. That causes error to accumulate across layers and forces manual tuning of per-layer sparsity. The paper targets a fast, memory-efficient method that finds layer-wise sparsity automatically and reduces output degradation after pruning.

Main Contribution

A block-wise pruning framework that minimizes reconstruction error at the transformer-block level, reducing error accumulation.

A parameter-efficient, differentiable method to learn layer-specific sparsity using a small set of combination coefficients and straight-through gradients.

Key Findings

Lower perplexity than prior one-shot methods at 50% unstructured sparsity on LLaMA models.

NumbersExample: LLaMA2-70B Wikitext2 ppl BESA 4.09 vs SparseGPT 4.25 (Table 1)

Practical UseChoosing BESA reduces language-model perplexity loss after aggressive pruning compared with SparseGPT/Wanda, so use BESA when you need higher post-prune quality.

Evidence RefTable 1

Prunes large LLMs quickly on a single GPU.

Numbers50% prune of LLaMA2-70B in ~5 hours on a single A100-80GB (paper claim)

Practical UseYou can prune 7B–70B models without multi-GPU clusters — feasible for small teams or single-GPU pipelines.

Evidence RefAbstract; Sec.4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (Wikitext2)4.09SparseGPT-0.16LLaMA2-70B, 50% unstructured sparsityBESA 4.09 vs SparseGPT 4.25 on Wikitext2Table 1
Perplexity (Wikitext2)6.86Wanda-0.40LLaMA-7B, 50% unstructured sparsityBESA 6.86 vs Wanda 7.26 on Wikitext2Table 1

What To Try In 7 Days

Run BESA on a smaller LLaMA (7B) with 128 calibration sequences and D=100 to reproduce baseline gains.

Try 50% unstructured pruning + 4-bit quantization to test memory and latency trade-offs.

Simulate runtime on your target accelerator or use ViTCoD-like simulator to estimate real speedups.

Optimization Features

Infra Optimization
Single A100-80GB pruning for 7B–70B modelsViTCoD simulator used to estimate accelerator-level speedups
Model Optimization
Unstructured weight pruningLearnable per-layer sparsityJoint weight-only quantization
System Optimization
Block-wise pipeline lowers peak GPU memory (prune one block at a time)Custom CUDA operator for row-wise mask generation (paper implementation)
Training Optimization
Differentiable sparsity allocation via combination coefficientsStraight-through estimator for mask gradientsBlock-wise reconstruction loss to reduce error accumulation
Inference Optimization
Layer-adaptive sparsity to reduce unnecessary computeDemonstrated simulated speedups on sparsity-aware accelerator

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Code URLs

code link referenced in paper (not provided in text)

Risks & Boundaries

Limitations

Unstructured sparsity may not map directly to standard n:m GPU speedups; requires accelerator support or specialized kernels.

Performance depends on calibration data (but gains plateau after ≈64 samples).

When Not To Use

If you need standard n:m structured sparsity for cuSPARSELt acceleration on NVIDIA GPUs.

If you cannot provide a small calibration set or run the block-wise pipeline.

Failure Modes

Insufficient calibration samples can increase perplexity degradation.

Incorrect choice of sparsity-step or D may cause convergence issues.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA-30BLLaMA-65BLLaMA2-7BLLaMA2-13BLLaMA2-70B

Metrics

PerplexityAccuracySimulated runtime cycles / speedup

Datasets

WikiText2C4PTBCalibration set (128 sequences from C4)

Benchmarks

PIQABoolQHellaSwagWinoGrandeARC EasyARC Challenge