Microscaling (MX): block-level scales let you run and train models at sub-8-bit with minimal accuracy loss

Overview

Decision SnapshotReady For Pilot

Strong empirical results across many models and tasks support practical use, with open code and a spec; caveats exist for some architectures and for transpose/representation edge cases.

Citations8

Evidence Strength0.82

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 70%

Authors

Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, Eric Chung

Links

Abstract / PDF / Code

Why It Matters For Business

Microscaling cuts memory and compute by moving to narrow, block-scaled formats while keeping model quality close to FP32, enabling cheaper inference and denser training without reengineering training recipes.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

Microscaling (MX) is a family of block-scaled narrow data formats that store a shared scale per small block plus low-bit elements. Across many vision, language, speech, and recommendation benchmarks the authors show MXINT8 can replace FP32 for direct inference with almost no accuracy loss. MXFP6 enables the first demonstrations of training generative language models with sub-8-bit weights, activations, and gradients to near-FP32 parity using the same training recipe. MXFP4 mixed with MXFP6 activations gives small extra loss. A PyTorch/CUDA library and an OCP specification are provided.

Problem Statement

Modern large models are costly to run and store. Tensor-level scaling for sub-8-bit formats has limited dynamic range and hurts accuracy. The paper tests a block-level (micro) scaling scheme that aims to preserve model quality while lowering bit-width, and to do so with low integration friction.

Main Contribution

Define and evaluate MX: block-level shared scale plus narrow element types (FP8/FP6/FP4/INT8).

Show MXINT8 is a low-friction, drop-in substitute for FP32 inference on many tasks.

Key Findings

MXINT8 closely matches FP32 for direct-cast inference across many models.

NumbersGPT3 ARC easy: FP32 0.744 → MXINT8 0.740 (∆ −0.004)

Practical UseFor low-friction inference, try MXINT8 first — it often keeps accuracy within measurement noise.

Evidence RefTable 5 (GPT3-175B direct-cast)

6-bit MX (MXFP6) can train generative language models to near-FP32 loss using the same training recipe.

NumbersGPT-1.5B train loss: FP32 2.74 → MXFP6 2.75 (∆ +0.01)

Practical UseYou can train medium-to-large GPT-style models with 6-bit formats without retuning optimizer or recipe.

Evidence RefTable 7 (training losses)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.740	FP32 0.744	−0.004	ARC easy	Direct-cast inference with MXINT8 on GPT3-175B	Table 5
Accuracy	77.15	FP32 77.40	−0.25	ImageNet	Error diffusion PTQ to MXFP6	Table 3

What To Try In 7 Days

Swap FP32 → MXINT8 for inference on a representative model to measure latency/memory gains.

Run MXFP6 PTQ or short finetune on a vision or translation model to confirm accuracy parity.

Clone the Microscaling PyTorch library and run a direct-cast experiment on a small GPT model.

Optimization Features

Infra Optimization

emulation via CUDA extension for current GPUsformats designed to map to hardware-friendly 8-bit exponent scale

Model Optimization

per-block (micro) scalingsub-8-bit element formats (FP6/FP4)INT8 block-scaled format (MXINT8)

System Optimization

reduced memory footprint via block-level scalesreuse of FP32 master weights with quantized compute flow

Training Optimization

training with quantized weights/activations/gradients (MXFP6)mixed-precision training (MXFP4 weights + MXFP6 activations)

Inference Optimization

direct-cast inference (no calibration)post-training quantization with error diffusion (PTQ)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/microsoft/microxcaling

Risks & Boundaries

Limitations

Transpose and quantize are non-commutative, requiring separate stored transposed tensors in some flows.

Very low-bit variants (MXFP4) can hurt accuracy on some models (e.g., MobileNet v2, some language tasks).

When Not To Use

On tiny/mobile models that showed large accuracy drops (e.g., MobileNet v2 in some settings).

When you cannot afford storing extra transposed tensors or metadata overhead.

Failure Modes

Clamping or overflow if element values exceed representable range; behavior may be implementation-defined.

Accuracy regressions for extreme low-bit formats without PTQ or finetuning.

Core Entities

Models

GPT3-175BLLaMA-7BGPT-1.5BGPT-300MGPT-150MGPT-20MBERT-BaseBERT-LargeDeiT-TinyDeiT-SmallResNet-18ResNet-50MobileNet v2Wav2Vec 2.0DLRM

Metrics

BLEUAccuracyWERAUCPerplexity / Training loss

Datasets

WMT-17WMT-16ImageNet ILSVRC12LibriSpeechCriteo TerabyteLambadaWikitext-2 (subset)ARC (easy/challenge)Hendryck's test subset

Benchmarks

LM Eval HarnessOCP Microscaling Specification

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MXINT8 closely matches FP32 for direct-cast inference across many models.

6-bit MX (MXFP6) can train generative language models to near-FP32 loss using the same training recipe.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding