BESA: differentiable block-wise pruning that learns layer sparsity — prunes 7B–70B models on one A100 in hours

February 18, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

3

Authors

Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

Links

Abstract / PDF

Why It Matters For Business

BESA makes aggressive pruning of large LLMs practical on a single A100 GPU, preserving model quality and enabling lower-cost deployment or faster inference when paired with quantization.

Summary TLDR

BESA is a practical pruning method for transformer LLMs that (1) prunes by minimizing reconstruction error at the transformer-block level instead of per-layer, and (2) learns layer-specific sparsity via a small set of differentiable parameters. It works on LLaMA-family models (7B–70B tested) on a single A100 GPU in hours and can be jointly optimized with weight-only quantization. Across perplexity and zero-shot tasks, BESA consistently outperforms prior one-shot pruning methods (SparseGPT, Wanda) at similar sparsity, and simulated hardware shows realistic speedups when accelerator hardware supports irregular sparsity.

Problem Statement

Current one-shot LLM pruning methods prune per layer with a hand-chosen uniform sparsity. That causes error to accumulate across layers and forces manual tuning of per-layer sparsity. The paper targets a fast, memory-efficient method that finds layer-wise sparsity automatically and reduces output degradation after pruning.

Main Contribution

A block-wise pruning framework that minimizes reconstruction error at the transformer-block level, reducing error accumulation.

A parameter-efficient, differentiable method to learn layer-specific sparsity using a small set of combination coefficients and straight-through gradients.

A joint pruning+quantization pipeline and experiments showing state-of-the-art perplexity and zero-shot accuracy versus SparseGPT and Wanda, plus simulated speedup analysis.

Key Findings

Lower perplexity than prior one-shot methods at 50% unstructured sparsity on LLaMA models.

NumbersExample: LLaMA2-70B Wikitext2 ppl BESA 4.09 vs SparseGPT 4.25 (Table 1)

Prunes large LLMs quickly on a single GPU.

Numbers50% prune of LLaMA2-70B in ~5 hours on a single A100-80GB (paper claim)

Joint pruning + 4-bit weight-only quantization outperforms Wanda post-pruning.

NumbersLLaMA-7B Wikitext2 ppl: Joint(BESA) 7.00 vs Joint-Wanda 7.44 (Table 3)

Simulated runtime speedups on a sparsity-aware accelerator.

NumbersLayer speedups reported 1.48×–1.98×; specific e.g., gate proj 1.94–1.98× (Table 4)

Small calibration sets suffice.

NumbersCalibration benefit plateaus after ~64 samples; 128 used by default (Appendix A)

Results

Perplexity (Wikitext2)

Value4.09

BaselineSparseGPT

Perplexity (Wikitext2)

Value6.86

BaselineWanda

Joint quant+prune perplexity (Wikitext2)

Value7.00

BaselineJoint-Wanda

Simulated speedup (ViTCoD)

Value1.83×–1.98×

BaselineDense runtime

Pruning time

Value≈5 hours

Baselinen/a

Who Should Care

What To Try In 7 Days

Run BESA on a smaller LLaMA (7B) with 128 calibration sequences and D=100 to reproduce baseline gains.

Try 50% unstructured pruning + 4-bit quantization to test memory and latency trade-offs.

Simulate runtime on your target accelerator or use ViTCoD-like simulator to estimate real speedups.

Optimization Features

Infra Optimization

  • Single A100-80GB pruning for 7B–70B models
  • ViTCoD simulator used to estimate accelerator-level speedups

Model Optimization

  • Unstructured weight pruning
  • Learnable per-layer sparsity
  • Joint weight-only quantization

System Optimization

  • Block-wise pipeline lowers peak GPU memory (prune one block at a time)
  • Custom CUDA operator for row-wise mask generation (paper implementation)

Training Optimization

  • Differentiable sparsity allocation via combination coefficients
  • Straight-through estimator for mask gradients
  • Block-wise reconstruction loss to reduce error accumulation

Inference Optimization

  • Layer-adaptive sparsity to reduce unnecessary compute
  • Demonstrated simulated speedups on sparsity-aware accelerator

Reproducibility

Code Urls

  • code link referenced in paper (not provided in text)

Code Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Unstructured sparsity may not map directly to standard n:m GPU speedups; requires accelerator support or specialized kernels.
  • Performance depends on calibration data (but gains plateau after ≈64 samples).
  • Code link referenced but not provided in text; reproduction may need authors' repo.

When Not To Use

  • If you need standard n:m structured sparsity for cuSPARSELt acceleration on NVIDIA GPUs.
  • If you cannot provide a small calibration set or run the block-wise pipeline.
  • When you require formal guarantees from full retraining rather than one-shot methods.

Failure Modes

  • Insufficient calibration samples can increase perplexity degradation.
  • Incorrect choice of sparsity-step or D may cause convergence issues.
  • Joint quant+prune might amplify errors if quantization clipping is not well tuned.

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • LLaMA-30B
  • LLaMA-65B
  • LLaMA2-7B
  • LLaMA2-13B
  • LLaMA2-70B

Metrics

  • Perplexity
  • Accuracy
  • Simulated runtime cycles / speedup

Datasets

  • WikiText2
  • C4
  • PTB
  • Calibration set (128 sequences from C4)

Benchmarks

  • PIQA
  • BoolQ
  • HellaSwag
  • WinoGrande
  • ARC Easy
  • ARC Challenge