Sparsify then quantize — the proven best order; combining them still adds nontrivial error

May 31, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper combines provable theorems and multi-model experiments; theoretical claims match empirical patterns but apply to max-scaled block quantization and magnitude pruning settings only.

Citations2

Evidence Strength0.90

Confidence0.87

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh

Links

Abstract / PDF

Why It Matters For Business

If you compress models with pruning and block-wise quantization, order and method choice change accuracy and thus service quality; using sparsity before quantization (S → Q) is an easy rule to reduce avoidable accuracy loss.

Who Should Care

Summary TLDR

The paper proves mathematically and shows empirically that sparsity (magnitude-based pruning) and max‑scaled block-wise quantization are not orthogonal: the order matters and combining them introduces extra error. Applying sparsity before quantization (S → Q) is optimal under the studied settings. Even with optimal ordering and sparsity-aware fine‑tuning, combined compression can raise perplexity or loss substantially for LLMs and vision models. The work gives layer-level error analysis, an orthogonality threshold to predict compounded error, and practical rules-of-thumb for deployers.

Problem Statement

Practitioners commonly combine sparsity and quantization to shrink models. Many assume the two effects add independently (are orthogonal). This paper asks: do they interact? If yes, which order is best and how large is the extra error when you combine them?

Main Contribution

Mathematical proof that magnitude-based sparsity and max-scaled block-wise quantization are non-orthogonal and can compound errors.

A formal argument and proof that sparsity before quantization (S → Q) is optimal under studied assumptions.

Key Findings

Sparsity and max-scaled block-wise quantization are non-orthogonal.

Practical UseDo not assume pruning+quantization errors just add; expect extra, order-dependent error and test combined configs before deployment.

Evidence RefTheorems 3.5–3.10 and Section 3

Applying sparsity before quantization (S → Q) is provably optimal for the studied transforms.

Practical UseWhen using magnitude-based pruning and max-scaled block quantization, apply pruning first, then quantize the sparse model.

Evidence RefTheorem 3.5 and proof in Appendix J

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity increase (combined vs dense)up to +13% (relative) on evaluated perplexity benchmarksdense FP32up to +13% relativeWikiText2 and reported LLMsAbstract; Section 4Abstract; Table 2
Memory & bandwidth reduction (8-bit) and 10.7× (6-bit) at 50% sparsitydense FP32 and 10.7× reductionsdeployment resource estimateDiscussion Section 5Section 5

What To Try In 7 Days

Apply magnitude-based pruning (sparsity) first, then post‑training block-wise quantize the sparse model (S → Q).

Run the orthogonality threshold: sum single-method errors vs combined error to detect non-orthogonality quickly.

Start with 8-bit block formats + 50% sparsity as a baseline; profile memory and accuracy trade-offs.

Optimization Features

Infra Optimization
Formats optimized for hardware (HBFP, MXFP, INT8)
Model Optimization
QuantizationSparsityN:M structured sparsityUnstructured sparsityBlock-wise (max-scaled) quantization
System Optimization
Memory and bandwidth reduction via sparsity+quantization
Training Optimization
Sparsity-aware fine-tuningOne-shot pruning (SparseGPT, Wanda)GPTQ-style compensation (post-training quant)
Inference Optimization
Post-training quantization of sparse models8-bit block formats as FP32 replacement

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Analysis assumes max-scaled block-wise quantization and magnitude-based pruning; other quantizers or pruning rules may behave differently

Heterogeneous layer-wise sparsity/bitwidth schemes are out of scope and may change trade-offs

When Not To Use

When you cannot fine-tune after pruning — one-shot magnitude pruning without retraining causes large degradation

When using non max-scaled quantizers or pruning policies that depend on activation-aware scores without verification

Failure Modes

Quantize-then-prune (Q → S) can create value collisions so pruning removes important weights, increasing error

Errors introduced early in deep models can amplify through layers and break downstream outputs

Core Entities

Models

OPT (125M, 6.7B, 350M, 1.3B variants)LLaMA (LLaMA-2-7B, LLaMA-3-8B)ViT-B/16ResNet-50

Metrics

PerplexityCross-entropy lossAccuracyL2 per-layer output error

Datasets

WikiText2ImageNet-1k

Benchmarks

PerplexityCross-entropy lossZero-shot tasks (ARC, HellaSWAG, WinoGrande)