Pick INT or FP per layer: mixing low-bit formats (MoFQ) improves LLM quantization and speed

May 21, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

5

Authors

Yijia Zhang, Lingran Zhao, Shijie Cao, Wenqiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu

Links

Abstract / PDF

Why It Matters For Business

Mixing low-bit INT and FP per layer can keep model accuracy while cutting model size and quantization time; it fits current hardware that supports both INT and FP low-bit ops and reduces deployment cost.

Summary TLDR

The paper compares low-bit integer (INT) and floating-point (FP) formats for LLM quantization and finds no single best format across layers. It proposes Mixture of Formats Quantization (MoFQ): pick INT or FP per layer at the same bit-width. MoFQ gives SOTA PTQ results: similar or better accuracy than prior methods for 4-bit weight-only quantization, much faster quantization time than GPTQ, and near full‑precision accuracy for 8-bit weight+activation quantization on LLaMA/OPT benchmarks. The method is simple, hardware-friendly, and works without changing bit-width.

Problem Statement

Low-bit quantization is needed to shrink LLM size and cost, but it is unclear whether integer (INT) or low-bit floating point (FP) formats work better. Tensors and layers vary in distribution, so a one-format-fits-all choice may be suboptimal. Practitioners need a fast, practical rule to pick formats that keeps accuracy and reduces inference cost.

Main Contribution

Comparative analysis of INT vs FP formats across bit widths, hardware cost, and quantization error on LLaMA/OPT tensors

MoFQ: a simple layer-wise format selector that picks INT or FP (same bit-width) using a chosen error metric

Practical tricks: redesign FP4 by reclaiming NaN/Inf encodings to improve 4-bit FP accuracy

Empirical results: MoFQ yields SOTA post-training quantization (W-only 4-bit and WA 8-bit) and much faster quantization than GPTQ

Key Findings

No single format (INT or FP) dominates across layers and bit widths.

NumbersWeight tensors: INT8 lower MSE than FP8; at 4-bit no consistent winner (figures 4,6).

MoFQ (per-layer INT/FP choice) matches or improves accuracy vs prior PTQ on 4-bit W-only quantization.

NumbersLLaMA-65B WikiText-2 perplexity: INT4(GPTQ)=3.85, MoFQ4=3.78 (lower is better).

MoFQ and FP-based methods are much faster than GPTQ for quantization runtime.

NumbersLLaMA-65B quant time: GPTQ(INT4)=4684s, FP4=36s (≈130x), MoFQ4=319s (≈14.7x).

For 8-bit weight+activation (WA) quantization, FP8 often outperforms INT8; MoFQ8 gets close to full precision.

NumbersLLaMA-13B WikiText-2 perplexity: FP16=5.09, INT8=637.95, FP8=5.64, MoFQ8=5.41.

Reallocating NaN/Inf in FP4 improves representable numbers and reduces quantization error.

NumbersRedesigned FP4 gives ~35% lower tensor quantization error vs IEEE-aligned FP4.

Results

WikiText-2 perplexity (LLaMA-65B, W-only 4-bit)

ValueMoFQ4=3.78

BaselineINT4(GPTQ)=3.85; FP16=3.53

Quantization time (LLaMA-65B, W-only 4-bit)

ValueFP4=36s; MoFQ4=319s

BaselineINT4(GPTQ)=4684s

WikiText-2 perplexity (LLaMA-13B, WA 8-bit)

ValueMoFQ8=5.41

BaselineFP16=5.09; FP8=5.64; INT8=637.95

FP4 redesign quant error

Value≈35% lower error

BaselineIEEE FP4

Who Should Care

What To Try In 7 Days

Run per-layer MSE-based format selection (MoFQ) on one LLM checkpoint using existing PPQ/GPTQ tools

Benchmark FP4 weight-only quantization vs INT4(GPTQ) on a single downstream task to compare runtime and accuracy

If using 4-bit FP weights in software, use the FP4 NaN/Inf reallocation trick to reduce tensor error

Optimization Features

Infra Optimization

  • targets hardware that supports FP8/INT8 (e.g., NVIDIA H100)

Model Optimization

  • post-training quantization (PTQ)
  • per-channel weight quantization
  • layer-wise format selection (INT vs FP)

System Optimization

  • keeps uniform bit-width per layer to avoid hardware changes
  • FP4 redesign reclaims NaN/Inf for better representable range

Inference Optimization

  • W8A8 enabling low-bit matrix multiplies
  • W-only 4-bit for memory footprint reduction

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Analysis and results are empirical; no theoretical guarantees provided
  • MoFQ sometimes does not outperform a single FP4 or INT4—selection can be imperfect
  • WA MoFQ needs hardware support for both low-bit INT and FP operations
  • FP4 NaN/Inf reallocation is a software trick and may break hardware-standard compatibility

When Not To Use

  • Target hardware only supports INT low-bit and not FP (WA case)
  • You need strict IEEE FP behavior or hardware-validated FP formats
  • You require proven theoretical guarantees rather than empirical improvements

Failure Modes

  • Per-layer metric (e.g., tensor MSE) may mispredict the format that best preserves end-task accuracy
  • FP4 redesign might be incompatible with hardware inference paths, producing incorrect runtime behavior
  • Mixed-format inference requires hardware/driver support; otherwise latency or correctness issues may arise

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • LLaMA-33B
  • LLaMA-65B
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B
  • OPT-30B

Metrics

  • MSE (tensor/layer)
  • perplexity
  • Accuracy
  • quantization runtime (s)
  • noise-signal power ratio

Datasets

  • WikiText-2
  • LAMBADA
  • PIQA
  • HellaSwag

Benchmarks

  • perplexity (WikiText-2)
  • Accuracy