Sparsify then quantize — the proven best order; combining them still adds nontrivial error

Overview

Decision SnapshotReady For Pilot

The paper combines provable theorems and multi-model experiments; theoretical claims match empirical patterns but apply to max-scaled block quantization and magnitude pruning settings only.

Citations2

Evidence Strength0.90

Confidence0.87

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh

Links

Abstract / PDF

Why It Matters For Business

If you compress models with pruning and block-wise quantization, order and method choice change accuracy and thus service quality; using sparsity before quantization (S → Q) is an easy rule to reduce avoidable accuracy loss.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

The paper proves mathematically and shows empirically that sparsity (magnitude-based pruning) and max‑scaled block-wise quantization are not orthogonal: the order matters and combining them introduces extra error. Applying sparsity before quantization (S → Q) is optimal under the studied settings. Even with optimal ordering and sparsity-aware fine‑tuning, combined compression can raise perplexity or loss substantially for LLMs and vision models. The work gives layer-level error analysis, an orthogonality threshold to predict compounded error, and practical rules-of-thumb for deployers.

Problem Statement

Practitioners commonly combine sparsity and quantization to shrink models. Many assume the two effects add independently (are orthogonal). This paper asks: do they interact? If yes, which order is best and how large is the extra error when you combine them?

Main Contribution

Mathematical proof that magnitude-based sparsity and max-scaled block-wise quantization are non-orthogonal and can compound errors.

A formal argument and proof that sparsity before quantization (S → Q) is optimal under studied assumptions.

Key Findings

Sparsity and max-scaled block-wise quantization are non-orthogonal.

Practical UseDo not assume pruning+quantization errors just add; expect extra, order-dependent error and test combined configs before deployment.

Evidence RefTheorems 3.5–3.10 and Section 3

Applying sparsity before quantization (S → Q) is provably optimal for the studied transforms.

Practical UseWhen using magnitude-based pruning and max-scaled block quantization, apply pruning first, then quantize the sparse model.

Evidence RefTheorem 3.5 and proof in Appendix J

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity increase (combined vs dense)	up to +13% (relative) on evaluated perplexity benchmarks	dense FP32	up to +13% relative	WikiText2 and reported LLMs	Abstract; Section 4	Abstract; Table 2
Memory & bandwidth reduction	8× (8-bit) and 10.7× (6-bit) at 50% sparsity	dense FP32	8× and 10.7× reductions	deployment resource estimate	Discussion Section 5	Section 5

What To Try In 7 Days

Apply magnitude-based pruning (sparsity) first, then post‑training block-wise quantize the sparse model (S → Q).

Run the orthogonality threshold: sum single-method errors vs combined error to detect non-orthogonality quickly.

Start with 8-bit block formats + 50% sparsity as a baseline; profile memory and accuracy trade-offs.

Optimization Features

Infra Optimization

Formats optimized for hardware (HBFP, MXFP, INT8)

Model Optimization

QuantizationSparsityN:M structured sparsityUnstructured sparsityBlock-wise (max-scaled) quantization

System Optimization

Memory and bandwidth reduction via sparsity+quantization

Training Optimization

Sparsity-aware fine-tuningOne-shot pruning (SparseGPT, Wanda)GPTQ-style compensation (post-training quant)

Inference Optimization

Post-training quantization of sparse models8-bit block formats as FP32 replacement

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Analysis assumes max-scaled block-wise quantization and magnitude-based pruning; other quantizers or pruning rules may behave differently

Heterogeneous layer-wise sparsity/bitwidth schemes are out of scope and may change trade-offs

When Not To Use

When you cannot fine-tune after pruning — one-shot magnitude pruning without retraining causes large degradation

When using non max-scaled quantizers or pruning policies that depend on activation-aware scores without verification

Failure Modes

Quantize-then-prune (Q → S) can create value collisions so pruning removes important weights, increasing error

Errors introduced early in deep models can amplify through layers and break downstream outputs

Core Entities

Models

OPT (125M, 6.7B, 350M, 1.3B variants)LLaMA (LLaMA-2-7B, LLaMA-3-8B)ViT-B/16ResNet-50

Metrics

PerplexityCross-entropy lossAccuracyL2 per-layer output error

Datasets

WikiText2ImageNet-1k

Benchmarks

PerplexityCross-entropy lossZero-shot tasks (ARC, HellaSWAG, WinoGrande)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Sparsity and max-scaled block-wise quantization are non-orthogonal.

Applying sparsity before quantization (S → Q) is provably optimal for the studied transforms.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding