Use FP8 activations and FP4 weights to keep LLM quality while cutting memory and using H100 FP support

Overview

Decision SnapshotReady For Pilot

Evaluation covers multiple model families and datasets with GPTQ-based PTQ; shows consistent FP8/FP4 advantages but tests focus on perplexity and a limited calibration set (128 C4 sequences) so real-world performance should be validated.

Citations8

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Xiaoxia Wu, Zhewei Yao, Yuxiong He

Links

Abstract / PDF / Code

Why It Matters For Business

Switching activations to FP8 and weights to FP4 can cut memory and exploit H100 FP8 hardware while keeping model quality—good for deploying large LLMs on constrained inference servers.

Who Should Care

ML Engineer Engineering Lead CTO Founder

Summary TLDR

This paper shows that post-training floating-point quantization (FP8 activations, FP4/FP8 weights) preserves LLM quality better than integer quantization (INT8/INT4), especially for models >1B parameters. They adapt GPTQ-style PTQ, add Low Rank Compensation (LoRC) to reduce error, and propose power-of-two scale constraints (two methods: M1, M2) to allow fast FP4→FP8 casting on H100 hardware with little loss. Tests on LLaMA and OPT (1.3B–30B) across Wikitext-2/PTB/C4 back the claims.

Problem Statement

Post-training quantization for LLMs must reduce memory and improve inference speed without hurting output quality. Integer uniform quantization (INT8/INT4) struggles with activation outliers and skewed distributions. The paper asks: can low-bit floating-point formats (FP8, FP4) plus light corrections keep quality while enabling efficient execution on FP-capable hardware?

Main Contribution

Show FP8 activations outperform INT8 for LLM PTQ, with larger gains on models >1B parameters.

Demonstrate FP4/FP8 weight quantization matches or beats INT4/INT8, enabling W4A8 FP deployment.

Key Findings

FP8 activations beat INT8 activations in perplexity across models, with larger wins for larger models.

NumbersLLaMA-7b W8A8: PPL 10.63 (INT) → 10.38 (FP); drop 0.25

Practical UsePrefer FP8 for activations when you care about generation quality, especially on models ≥6.7B.

Evidence RefMain Results; Table 2

Switching weights to FP4 with FP8 activations recovers quality versus INT4+FP8 and can improve results.

NumbersLLaMA-7b W4A8: PPL 11.48 (INT4) → 11.08 (FP4); drop 0.40

Practical UseUse FP4 weights (with FP8 activations) to get lower memory than FP8 while keeping or improving accuracy; simpler to support on H100 than INT4.

Evidence RefMain Results; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity	LLaMA-7b W8A8 INT→FP: 10.63 → 10.38	W8A8 INT activation	-0.25	average over Wikitext-2/PTB/C4 (Table 2)	Main Results paragraph and Table 2	Table 2
Perplexity	LLaMA-7b W4A8 INT4→FP4: 11.48 → 11.08	W4A8 INT weights	-0.40	average over Wikitext-2/PTB/C4 (Table 2)	Main Results paragraph and Table 2	Table 2

What To Try In 7 Days

On a small LLM (1–7B), quantize activations to FP8 and weights to FP4 with GPTQ and evaluate PPL on a held-out set.

Add LoRC post-quantization to check if it recovers quality for your smaller models.

Implement M2 (grouped power-of-two scales) to enable fast FP4→FP8 casting and measure end-to-end latency on H100.

Optimization Features

Token Efficiency

Token-wise activation quantization (reduces per-token overhead)

Infra Optimization

Design for FP8 hardware (H100) to gain throughput

Model Optimization

FP8 activation quantizationFP4/FP8 weight quantizationLoRC low-rank correction

System Optimization

Target H100 FP8 execution pathGroup-size 256 fine-grained quantization

Inference Optimization

Power-of-two scale constraints for fast castingToken-wise activation quantization for latency

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/microsoft/DeepSpeed

Risks & Boundaries

Limitations

Evaluation metric is mainly perplexity; downstream task effects not measured.

PTQ data uses 128 C4 sentences (lightweight calibration), may not cover all data regimes.

When Not To Use

If your deployment hardware lacks FP8/FP4 support (no H100), integer schemes may be faster.

If you need quantize-aware training or lower-bit integer compatibility (INT4-only toolchains).

Failure Modes

Outlier activations not covered by calibration may still cause quality drops.

Power-of-two scaling (M1) can degrade accuracy more than grouped (M2) if applied naively.

Core Entities

Models

LLaMA-1.3bLLaMA-3bLLaMA-7bLLaMA-13bLLaMA-30bOPT-1.3bOPT-6.7bOPT-13bOPT-30b

Metrics

Perplexity (lower better)

Datasets

Wikitext-2PTBC4

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FP8 activations beat INT8 activations in perplexity across models, with larger wins for larger models.

Switching weights to FP4 with FP8 activations recovers quality versus INT4+FP8 and can improve results.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding