Make one quantized LLM run at many precisions by routing tokens to bit-slices

Overview

Decision SnapshotNeeds Validation

The method is practical: it uses standard PTQ calibration, small router training, and integrates kernel hacks for speed. Gains depend on hardware that can exploit bit-plane BMMA and on careful per-layer threshold calibration.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang

Links

Abstract / PDF

Why It Matters For Business

MoBiQuant lets one quantized model adapt accuracy vs cost at runtime, avoiding multiple checkpoints and saving memory, calibration time, and potentially cloud inference dollars when hardware supports bit-plane kernels.

Who Should Care

ML Engineer Engineering Lead CTO Data Scientist Founder

Summary TLDR

MoBiQuant is a post-training quantization (PTQ) framework that makes a single LLM checkpoint run at many effective bit-widths by slicing weights into residual bit-slices and routing tokens to activate slices dynamically. This reduces a precision-dependent outlier migration problem (tokens that become outliers at different bit-widths), lets the model smoothly change average precision at runtime, matches static 4-bit PTQ performance on several LLaMA models, and adds a GPU kernel achieving up to 2.7× speedup on an A100 for long contexts.

Problem Statement

Static PTQ calibrations overfit to a single bit-width because token-level quantization errors shift when precision changes (outlier migration). That makes switching precisions at runtime brittle and costly. The paper asks: can one checkpoint support many precisions and adapt per token to avoid re-calibration and improve elastic inference?

Main Contribution

MoBiSlice: recursive residual bit-slicing that builds many precisions from one checkpoint by summing MSB plus residual 2-bit slices.

MoBiRoute: a lightweight token router that learns binary slice activation per token and is trained with a budget-aware schedule.

Key Findings

Calibration tuned to one bit-width fails when inference bit-width changes because token outliers migrate.

Numbers3-bit→4-bit test caused +2.65 perplexity vs 4-bit calibrated model

Practical UseDon’t rely on a single-bit PTQ calibration if you plan to change precision; consider token-adaptive schemes.

Evidence RefSection 3, Fig.1

A single MoBiQuant checkpoint matches or slightly improves static 4-bit PTQ on evaluated LLaMA models.

NumbersLLaMA-3-8B WikiText2 PPL 4-bit: MoBiQuant 7.31 vs OmniQuant 7.36

Practical UseYou can deploy one elastic checkpoint instead of maintaining separate per-bit checkpoints and retain comparable accuracy.

Evidence RefTable 1 (main results)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity generalization	MoBiQuant retains low PPL across unseen bit-widths; example: LLaMA-3-8B 4-bit PPL 7.31	OmniQuant 4-bit PPL 7.36	-0.05 PPL	WikiText2	Table 1; Fig.4	Table 1, Fig.4
Precision mismatch cost	Calibration-inference mismatch can raise PPL	3-bit calibrated model evaluated at 4-bit	+2.65 PPL	LLaMA3-8B, WikiText2 scenario in Section 3	Section 3, Fig.1	Section 3, Fig.1

What To Try In 7 Days

Run MoBiQuant PTQ on a 7B LLaMA model with 128 calibration sequences from WikiText2 and compare PPL to your current 3/4-bit PTQ.

Measure real latency on your GPU: try bit-major packing and simple token grouping to approximate the kernel gains.

Test router thresholds to trade average bits vs latency and set a live policy for peak vs off-peak load.

Optimization Features

Token Efficiency

per-token bit assignment and average-bit budgetingsupports fine-grained 2–6 bit trade-offs

Model Optimization

weight-only post-training quantization with residual bit-slicesmany-in-one slice composition to build multiple precisions from one checkpointtoken-adaptive routing to choose slices per token

System Optimization

binary matrix multiplication (BMMA) on packed bit-planesparallel CUDA stream overlap for slice computation

Training Optimization

layer-wise calibrationrouter temperature annealingbudget-aware regularization to reach average bit targets

Inference Optimization

binary gating per token to control effective bitsbit-major packing and on-demand bit-plane loadingbit-slice-major token permutation for coalesced memory

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Requires a calibration phase (128 sequences used in experiments), so fully zero-cost deployment is not shown

Kernel speedups rely on bit-plane BMMA and memory layout changes not available on all stacks

When Not To Use

If your runtime GPU/stack cannot support bit-plane packed BMMA or custom CUDA streams

When you cannot afford any calibration budget or cannot run layer-wise calibration

Failure Modes

Router misrouting under distribution shift causing under-allocation of bits to important tokens

LET (learnable transforms) interaction can destabilize router inputs unless undone for routing

Core Entities

Models

LLaMA2-7BLLaMA2-13BLLaMA3-8BLLaMA3.2-1BLLaMA3.2-3B

Metrics

perplexity (PPL)Accuracyinference latency / speedup

Datasets

WikiText2C4PTB

Benchmarks

WikiText2 perplexityzero-shot commonsense tasks (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Calibration tuned to one bit-width fails when inference bit-width changes because token outliers migrate.

A single MoBiQuant checkpoint matches or slightly improves static 4-bit PTQ on evaluated LLaMA models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding