Microscaling (MX): block-level scales let you run and train models at sub-8-bit with minimal accuracy loss

October 16, 20237 min

Overview

Production Readiness

0.8

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

8

Authors

Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, Eric Chung

Links

Abstract / PDF

Why It Matters For Business

Microscaling cuts memory and compute by moving to narrow, block-scaled formats while keeping model quality close to FP32, enabling cheaper inference and denser training without reengineering training recipes.

Summary TLDR

Microscaling (MX) is a family of block-scaled narrow data formats that store a shared scale per small block plus low-bit elements. Across many vision, language, speech, and recommendation benchmarks the authors show MXINT8 can replace FP32 for direct inference with almost no accuracy loss. MXFP6 enables the first demonstrations of training generative language models with sub-8-bit weights, activations, and gradients to near-FP32 parity using the same training recipe. MXFP4 mixed with MXFP6 activations gives small extra loss. A PyTorch/CUDA library and an OCP specification are provided.

Problem Statement

Modern large models are costly to run and store. Tensor-level scaling for sub-8-bit formats has limited dynamic range and hurts accuracy. The paper tests a block-level (micro) scaling scheme that aims to preserve model quality while lowering bit-width, and to do so with low integration friction.

Main Contribution

Define and evaluate MX: block-level shared scale plus narrow element types (FP8/FP6/FP4/INT8).

Show MXINT8 is a low-friction, drop-in substitute for FP32 inference on many tasks.

Demonstrate training generative language models with sub-8-bit weights, activations, and gradients (MXFP6) matching FP32 without changing training recipes.

Show mixed precision with 4-bit weights (MXFP4) and 6-bit activations yields only minor loss.

Provide an open PyTorch/CUDA library and reference OCP MX specification for reproduction.

Key Findings

MXINT8 closely matches FP32 for direct-cast inference across many models.

NumbersGPT3 ARC easy: FP32 0.744 → MXINT8 0.740 (∆ −0.004)

6-bit MX (MXFP6) can train generative language models to near-FP32 loss using the same training recipe.

NumbersGPT-1.5B train loss: FP32 2.74 → MXFP6 2.75 (∆ +0.01)

Mixed 4-bit weights + 6-bit activations causes only minor extra loss.

NumbersGPT-1.5B: FP32 2.74 → MXFP4_wt + MXFP6_act 2.76 (∆ +0.02)

Post-training quantization with error diffusion (PTQ) restores accuracy for 6-bit MX.

NumbersResNet-50 Top-1: FP32 77.40 → MXFP6 (PTQ) 77.15 (∆ −0.25)

Results

Accuracy

Value0.740

BaselineFP32 0.744

Accuracy

Value77.15

BaselineFP32 77.40

GPT-1.5B training loss

Value2.75

BaselineFP32 2.74

GPT-1.5B mixed-precision training loss

Value2.76

BaselineFP32 2.74

Who Should Care

What To Try In 7 Days

Swap FP32 → MXINT8 for inference on a representative model to measure latency/memory gains.

Run MXFP6 PTQ or short finetune on a vision or translation model to confirm accuracy parity.

Clone the Microscaling PyTorch library and run a direct-cast experiment on a small GPT model.

Optimization Features

Infra Optimization

  • emulation via CUDA extension for current GPUs
  • formats designed to map to hardware-friendly 8-bit exponent scale

Model Optimization

  • per-block (micro) scaling
  • sub-8-bit element formats (FP6/FP4)
  • INT8 block-scaled format (MXINT8)

System Optimization

  • reduced memory footprint via block-level scales
  • reuse of FP32 master weights with quantized compute flow

Training Optimization

  • training with quantized weights/activations/gradients (MXFP6)
  • mixed-precision training (MXFP4 weights + MXFP6 activations)

Inference Optimization

  • direct-cast inference (no calibration)
  • post-training quantization with error diffusion (PTQ)

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Transpose and quantize are non-commutative, requiring separate stored transposed tensors in some flows.
  • Very low-bit variants (MXFP4) can hurt accuracy on some models (e.g., MobileNet v2, some language tasks).
  • Conversion algorithm choices and block axis selection matter and are implementation-defined.
  • Experiments use an emulation CUDA library; native hardware support will affect performance and latency.

When Not To Use

  • On tiny/mobile models that showed large accuracy drops (e.g., MobileNet v2 in some settings).
  • When you cannot afford storing extra transposed tensors or metadata overhead.
  • If your stack lacks backend support and emulation latency outweighs memory/compute gains.

Failure Modes

  • Clamping or overflow if element values exceed representable range; behavior may be implementation-defined.
  • Accuracy regressions for extreme low-bit formats without PTQ or finetuning.
  • Transposing MX tensors can change shared-scale axes and thus values if not handled correctly.

Core Entities

Models

  • GPT3-175B
  • LLaMA-7B
  • GPT-1.5B
  • GPT-300M
  • GPT-150M
  • GPT-20M
  • BERT-Base
  • BERT-Large
  • DeiT-Tiny
  • DeiT-Small
  • ResNet-18
  • ResNet-50
  • MobileNet v2
  • Wav2Vec 2.0
  • DLRM

Metrics

  • BLEU
  • Accuracy
  • WER
  • AUC
  • Perplexity / Training loss

Datasets

  • WMT-17
  • WMT-16
  • ImageNet ILSVRC12
  • LibriSpeech
  • Criteo Terabyte
  • Lambada
  • Wikitext-2 (subset)
  • ARC (easy/challenge)
  • Hendryck's test subset

Benchmarks

  • LM Eval Harness
  • OCP Microscaling Specification