Use FP8 activations and FP4 weights to keep LLM quality while cutting memory and using H100 FP support

July 19, 20237 min

Overview

Decision SnapshotReady For Pilot

Evaluation covers multiple model families and datasets with GPTQ-based PTQ; shows consistent FP8/FP4 advantages but tests focus on perplexity and a limited calibration set (128 C4 sequences) so real-world performance should be validated.

Citations8

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Xiaoxia Wu, Zhewei Yao, Yuxiong He

Links

Abstract / PDF / Code

Why It Matters For Business

Switching activations to FP8 and weights to FP4 can cut memory and exploit H100 FP8 hardware while keeping model quality—good for deploying large LLMs on constrained inference servers.

Who Should Care

Summary TLDR

This paper shows that post-training floating-point quantization (FP8 activations, FP4/FP8 weights) preserves LLM quality better than integer quantization (INT8/INT4), especially for models >1B parameters. They adapt GPTQ-style PTQ, add Low Rank Compensation (LoRC) to reduce error, and propose power-of-two scale constraints (two methods: M1, M2) to allow fast FP4→FP8 casting on H100 hardware with little loss. Tests on LLaMA and OPT (1.3B–30B) across Wikitext-2/PTB/C4 back the claims.

Problem Statement

Post-training quantization for LLMs must reduce memory and improve inference speed without hurting output quality. Integer uniform quantization (INT8/INT4) struggles with activation outliers and skewed distributions. The paper asks: can low-bit floating-point formats (FP8, FP4) plus light corrections keep quality while enabling efficient execution on FP-capable hardware?

Main Contribution

Show FP8 activations outperform INT8 for LLM PTQ, with larger gains on models >1B parameters.

Demonstrate FP4/FP8 weight quantization matches or beats INT4/INT8, enabling W4A8 FP deployment.

Key Findings

FP8 activations beat INT8 activations in perplexity across models, with larger wins for larger models.

NumbersLLaMA-7b W8A8: PPL 10.63 (INT) → 10.38 (FP); drop 0.25

Practical UsePrefer FP8 for activations when you care about generation quality, especially on models ≥6.7B.

Evidence RefMain Results; Table 2

Switching weights to FP4 with FP8 activations recovers quality versus INT4+FP8 and can improve results.

NumbersLLaMA-7b W4A8: PPL 11.48 (INT4) → 11.08 (FP4); drop 0.40

Practical UseUse FP4 weights (with FP8 activations) to get lower memory than FP8 while keeping or improving accuracy; simpler to support on H100 than INT4.

Evidence RefMain Results; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
PerplexityLLaMA-7b W8A8 INT→FP: 10.6310.38W8A8 INT activation-0.25average over Wikitext-2/PTB/C4 (Table 2)Main Results paragraph and Table 2Table 2
PerplexityLLaMA-7b W4A8 INT4→FP4: 11.4811.08W4A8 INT weights-0.40average over Wikitext-2/PTB/C4 (Table 2)Main Results paragraph and Table 2Table 2

What To Try In 7 Days

On a small LLM (1–7B), quantize activations to FP8 and weights to FP4 with GPTQ and evaluate PPL on a held-out set.

Add LoRC post-quantization to check if it recovers quality for your smaller models.

Implement M2 (grouped power-of-two scales) to enable fast FP4→FP8 casting and measure end-to-end latency on H100.

Optimization Features

Token Efficiency
Token-wise activation quantization (reduces per-token overhead)
Infra Optimization
Design for FP8 hardware (H100) to gain throughput
Model Optimization
FP8 activation quantizationFP4/FP8 weight quantizationLoRC low-rank correction
System Optimization
Target H100 FP8 execution pathGroup-size 256 fine-grained quantization
Inference Optimization
Power-of-two scale constraints for fast castingToken-wise activation quantization for latency

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation metric is mainly perplexity; downstream task effects not measured.

PTQ data uses 128 C4 sentences (lightweight calibration), may not cover all data regimes.

When Not To Use

If your deployment hardware lacks FP8/FP4 support (no H100), integer schemes may be faster.

If you need quantize-aware training or lower-bit integer compatibility (INT4-only toolchains).

Failure Modes

Outlier activations not covered by calibration may still cause quality drops.

Power-of-two scaling (M1) can degrade accuracy more than grouped (M2) if applied naively.

Core Entities

Models

LLaMA-1.3bLLaMA-3bLLaMA-7bLLaMA-13bLLaMA-30bOPT-1.3bOPT-6.7bOPT-13bOPT-30b

Metrics

Perplexity (lower better)

Datasets

Wikitext-2PTBC4