Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
5
Why It Matters For Business
Mixing low-bit INT and FP per layer can keep model accuracy while cutting model size and quantization time; it fits current hardware that supports both INT and FP low-bit ops and reduces deployment cost.
Summary TLDR
The paper compares low-bit integer (INT) and floating-point (FP) formats for LLM quantization and finds no single best format across layers. It proposes Mixture of Formats Quantization (MoFQ): pick INT or FP per layer at the same bit-width. MoFQ gives SOTA PTQ results: similar or better accuracy than prior methods for 4-bit weight-only quantization, much faster quantization time than GPTQ, and near full‑precision accuracy for 8-bit weight+activation quantization on LLaMA/OPT benchmarks. The method is simple, hardware-friendly, and works without changing bit-width.
Problem Statement
Low-bit quantization is needed to shrink LLM size and cost, but it is unclear whether integer (INT) or low-bit floating point (FP) formats work better. Tensors and layers vary in distribution, so a one-format-fits-all choice may be suboptimal. Practitioners need a fast, practical rule to pick formats that keeps accuracy and reduces inference cost.
Main Contribution
Comparative analysis of INT vs FP formats across bit widths, hardware cost, and quantization error on LLaMA/OPT tensors
MoFQ: a simple layer-wise format selector that picks INT or FP (same bit-width) using a chosen error metric
Practical tricks: redesign FP4 by reclaiming NaN/Inf encodings to improve 4-bit FP accuracy
Empirical results: MoFQ yields SOTA post-training quantization (W-only 4-bit and WA 8-bit) and much faster quantization than GPTQ
Key Findings
No single format (INT or FP) dominates across layers and bit widths.
MoFQ (per-layer INT/FP choice) matches or improves accuracy vs prior PTQ on 4-bit W-only quantization.
MoFQ and FP-based methods are much faster than GPTQ for quantization runtime.
For 8-bit weight+activation (WA) quantization, FP8 often outperforms INT8; MoFQ8 gets close to full precision.
Reallocating NaN/Inf in FP4 improves representable numbers and reduces quantization error.
Results
WikiText-2 perplexity (LLaMA-65B, W-only 4-bit)
Quantization time (LLaMA-65B, W-only 4-bit)
WikiText-2 perplexity (LLaMA-13B, WA 8-bit)
FP4 redesign quant error
Who Should Care
What To Try In 7 Days
Run per-layer MSE-based format selection (MoFQ) on one LLM checkpoint using existing PPQ/GPTQ tools
Benchmark FP4 weight-only quantization vs INT4(GPTQ) on a single downstream task to compare runtime and accuracy
If using 4-bit FP weights in software, use the FP4 NaN/Inf reallocation trick to reduce tensor error
Optimization Features
Infra Optimization
- targets hardware that supports FP8/INT8 (e.g., NVIDIA H100)
Model Optimization
- post-training quantization (PTQ)
- per-channel weight quantization
- layer-wise format selection (INT vs FP)
System Optimization
- keeps uniform bit-width per layer to avoid hardware changes
- FP4 redesign reclaims NaN/Inf for better representable range
Inference Optimization
- W8A8 enabling low-bit matrix multiplies
- W-only 4-bit for memory footprint reduction
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Analysis and results are empirical; no theoretical guarantees provided
- MoFQ sometimes does not outperform a single FP4 or INT4—selection can be imperfect
- WA MoFQ needs hardware support for both low-bit INT and FP operations
- FP4 NaN/Inf reallocation is a software trick and may break hardware-standard compatibility
When Not To Use
- Target hardware only supports INT low-bit and not FP (WA case)
- You need strict IEEE FP behavior or hardware-validated FP formats
- You require proven theoretical guarantees rather than empirical improvements
Failure Modes
- Per-layer metric (e.g., tensor MSE) may mispredict the format that best preserves end-task accuracy
- FP4 redesign might be incompatible with hardware inference paths, producing incorrect runtime behavior
- Mixed-format inference requires hardware/driver support; otherwise latency or correctness issues may arise
Core Entities
Models
- LLaMA-7B
- LLaMA-13B
- LLaMA-33B
- LLaMA-65B
- OPT-350M
- OPT-1.3B
- OPT-2.7B
- OPT-6.7B
- OPT-13B
- OPT-30B
Metrics
- MSE (tensor/layer)
- perplexity
- Accuracy
- quantization runtime (s)
- noise-signal power ratio
Datasets
- WikiText-2
- LAMBADA
- PIQA
- HellaSwag
Benchmarks
- perplexity (WikiText-2)
- Accuracy

