Sparsify then quantize — the proven best order; combining them still adds nontrivial error

May 31, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh

Links

Abstract / PDF

Why It Matters For Business

If you compress models with pruning and block-wise quantization, order and method choice change accuracy and thus service quality; using sparsity before quantization (S → Q) is an easy rule to reduce avoidable accuracy loss.

Summary TLDR

The paper proves mathematically and shows empirically that sparsity (magnitude-based pruning) and max‑scaled block-wise quantization are not orthogonal: the order matters and combining them introduces extra error. Applying sparsity before quantization (S → Q) is optimal under the studied settings. Even with optimal ordering and sparsity-aware fine‑tuning, combined compression can raise perplexity or loss substantially for LLMs and vision models. The work gives layer-level error analysis, an orthogonality threshold to predict compounded error, and practical rules-of-thumb for deployers.

Problem Statement

Practitioners commonly combine sparsity and quantization to shrink models. Many assume the two effects add independently (are orthogonal). This paper asks: do they interact? If yes, which order is best and how large is the extra error when you combine them?

Main Contribution

Mathematical proof that magnitude-based sparsity and max-scaled block-wise quantization are non-orthogonal and can compound errors.

A formal argument and proof that sparsity before quantization (S → Q) is optimal under studied assumptions.

Extensive experiments on LLMs (OPT, LLaMA), ViT and ResNet showing S → Q consistently beats Q → S and that combined compression can add substantial error.

A practical metric (orthogonality threshold) to estimate when combined compression will exceed the sum of individual errors.

Layer-wise analysis showing error accumulates through transformer layers and quantization-before-sparsity accelerates accumulation.

Key Findings

Sparsity and max-scaled block-wise quantization are non-orthogonal.

Applying sparsity before quantization (S → Q) is provably optimal for the studied transforms.

Combined compression can meaningfully hurt model outputs; worst-case extra error is sizable.

NumbersUp to 13% larger perplexity error reported

Order Q → S (quantize then prune) can cause quantization-induced collisions that make pruning remove previously important weights.

NumbersBlock-level unique-value reductions of up to 64% observed (example block size 64)

Error accumulates across layers; S → Q yields lower per-layer and cumulative errors than Q → S.

NumbersLayer errors grow with depth; S → Q shows consistently lower L2 per-layer error in Figure 1

Hardware-friendly 8-bit quantization combined with 50% sparsity gives large memory/bandwidth gains while often keeping accuracy acceptable.

NumbersAt 50% sparsity, 8-bit and 6-bit quantization reduce memory+bandwidth by 8× and 10.7× respectively

Results

Perplexity increase (combined vs dense)

Valueup to +13% (relative) on evaluated perplexity benchmarks

Baselinedense FP32

Memory & bandwidth reduction

Value8× (8-bit) and 10.7× (6-bit) at 50% sparsity

Baselinedense FP32

Order gap (S → Q vs Q → S)

ValueS → Q consistently yields lower perplexity; example checkpoints show up to 7% relative gap

BaselineQ → S

Who Should Care

What To Try In 7 Days

Apply magnitude-based pruning (sparsity) first, then post‑training block-wise quantize the sparse model (S → Q).

Run the orthogonality threshold: sum single-method errors vs combined error to detect non-orthogonality quickly.

Start with 8-bit block formats + 50% sparsity as a baseline; profile memory and accuracy trade-offs.

Optimization Features

Infra Optimization

  • Formats optimized for hardware (HBFP, MXFP, INT8)

Model Optimization

  • Quantization
  • Sparsity
  • N:M structured sparsity
  • Unstructured sparsity
  • Block-wise (max-scaled) quantization

System Optimization

  • Memory and bandwidth reduction via sparsity+quantization

Training Optimization

  • Sparsity-aware fine-tuning
  • One-shot pruning (SparseGPT, Wanda)
  • GPTQ-style compensation (post-training quant)

Inference Optimization

  • Post-training quantization of sparse models
  • 8-bit block formats as FP32 replacement

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Analysis assumes max-scaled block-wise quantization and magnitude-based pruning; other quantizers or pruning rules may behave differently
  • Heterogeneous layer-wise sparsity/bitwidth schemes are out of scope and may change trade-offs
  • Many experiments rely on sparsity-aware fine-tuning and keep master weights in FP32; results without fine-tuning are worse

When Not To Use

  • When you cannot fine-tune after pruning — one-shot magnitude pruning without retraining causes large degradation
  • When using non max-scaled quantizers or pruning policies that depend on activation-aware scores without verification

Failure Modes

  • Quantize-then-prune (Q → S) can create value collisions so pruning removes important weights, increasing error
  • Errors introduced early in deep models can amplify through layers and break downstream outputs
  • Combining aggressive sub-8-bit formats with structured sparsity can cause large accuracy drops even in S → Q order

Core Entities

Models

  • OPT (125M, 6.7B, 350M, 1.3B variants)
  • LLaMA (LLaMA-2-7B, LLaMA-3-8B)
  • ViT-B/16
  • ResNet-50

Metrics

  • Perplexity
  • Cross-entropy loss
  • Accuracy
  • L2 per-layer output error

Datasets

  • WikiText2
  • ImageNet-1k

Benchmarks

  • Perplexity
  • Cross-entropy loss
  • Zero-shot tasks (ARC, HellaSWAG, WinoGrande)