Make one quantized LLM run at many precisions by routing tokens to bit-slices

February 21, 20267 min

Overview

Decision SnapshotNeeds Validation

The method is practical: it uses standard PTQ calibration, small router training, and integrates kernel hacks for speed. Gains depend on hardware that can exploit bit-plane BMMA and on careful per-layer threshold calibration.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang

Links

Abstract / PDF

Why It Matters For Business

MoBiQuant lets one quantized model adapt accuracy vs cost at runtime, avoiding multiple checkpoints and saving memory, calibration time, and potentially cloud inference dollars when hardware supports bit-plane kernels.

Who Should Care

Summary TLDR

MoBiQuant is a post-training quantization (PTQ) framework that makes a single LLM checkpoint run at many effective bit-widths by slicing weights into residual bit-slices and routing tokens to activate slices dynamically. This reduces a precision-dependent outlier migration problem (tokens that become outliers at different bit-widths), lets the model smoothly change average precision at runtime, matches static 4-bit PTQ performance on several LLaMA models, and adds a GPU kernel achieving up to 2.7× speedup on an A100 for long contexts.

Problem Statement

Static PTQ calibrations overfit to a single bit-width because token-level quantization errors shift when precision changes (outlier migration). That makes switching precisions at runtime brittle and costly. The paper asks: can one checkpoint support many precisions and adapt per token to avoid re-calibration and improve elastic inference?

Main Contribution

MoBiSlice: recursive residual bit-slicing that builds many precisions from one checkpoint by summing MSB plus residual 2-bit slices.

MoBiRoute: a lightweight token router that learns binary slice activation per token and is trained with a budget-aware schedule.

Key Findings

Calibration tuned to one bit-width fails when inference bit-width changes because token outliers migrate.

Numbers3-bit→4-bit test caused +2.65 perplexity vs 4-bit calibrated model

Practical UseDon’t rely on a single-bit PTQ calibration if you plan to change precision; consider token-adaptive schemes.

Evidence RefSection 3, Fig.1

A single MoBiQuant checkpoint matches or slightly improves static 4-bit PTQ on evaluated LLaMA models.

NumbersLLaMA-3-8B WikiText2 PPL 4-bit: MoBiQuant 7.31 vs OmniQuant 7.36

Practical UseYou can deploy one elastic checkpoint instead of maintaining separate per-bit checkpoints and retain comparable accuracy.

Evidence RefTable 1 (main results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity generalizationMoBiQuant retains low PPL across unseen bit-widths; example: LLaMA-3-8B 4-bit PPL 7.31OmniQuant 4-bit PPL 7.36-0.05 PPLWikiText2Table 1; Fig.4Table 1, Fig.4
Precision mismatch costCalibration-inference mismatch can raise PPL3-bit calibrated model evaluated at 4-bit+2.65 PPLLLaMA3-8B, WikiText2 scenario in Section 3Section 3, Fig.1Section 3, Fig.1

What To Try In 7 Days

Run MoBiQuant PTQ on a 7B LLaMA model with 128 calibration sequences from WikiText2 and compare PPL to your current 3/4-bit PTQ.

Measure real latency on your GPU: try bit-major packing and simple token grouping to approximate the kernel gains.

Test router thresholds to trade average bits vs latency and set a live policy for peak vs off-peak load.

Optimization Features

Token Efficiency
per-token bit assignment and average-bit budgetingsupports fine-grained 2–6 bit trade-offs
Model Optimization
weight-only post-training quantization with residual bit-slicesmany-in-one slice composition to build multiple precisions from one checkpointtoken-adaptive routing to choose slices per token
System Optimization
binary matrix multiplication (BMMA) on packed bit-planesparallel CUDA stream overlap for slice computation
Training Optimization
layer-wise calibrationrouter temperature annealingbudget-aware regularization to reach average bit targets
Inference Optimization
binary gating per token to control effective bitsbit-major packing and on-demand bit-plane loadingbit-slice-major token permutation for coalesced memory

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Requires a calibration phase (128 sequences used in experiments), so fully zero-cost deployment is not shown

Kernel speedups rely on bit-plane BMMA and memory layout changes not available on all stacks

When Not To Use

If your runtime GPU/stack cannot support bit-plane packed BMMA or custom CUDA streams

When you cannot afford any calibration budget or cannot run layer-wise calibration

Failure Modes

Router misrouting under distribution shift causing under-allocation of bits to important tokens

LET (learnable transforms) interaction can destabilize router inputs unless undone for routing

Core Entities

Models

LLaMA2-7BLLaMA2-13BLLaMA3-8BLLaMA3.2-1BLLaMA3.2-3B

Metrics

perplexity (PPL)Accuracyinference latency / speedup

Datasets

WikiText2C4PTB

Benchmarks

WikiText2 perplexityzero-shot commonsense tasks (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge)