Use FP8 activations and FP4 weights to keep LLM quality while cutting memory and using H100 FP support

July 19, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

8

Authors

Xiaoxia Wu, Zhewei Yao, Yuxiong He

Links

Abstract / PDF

Why It Matters For Business

Switching activations to FP8 and weights to FP4 can cut memory and exploit H100 FP8 hardware while keeping model quality—good for deploying large LLMs on constrained inference servers.

Summary TLDR

This paper shows that post-training floating-point quantization (FP8 activations, FP4/FP8 weights) preserves LLM quality better than integer quantization (INT8/INT4), especially for models >1B parameters. They adapt GPTQ-style PTQ, add Low Rank Compensation (LoRC) to reduce error, and propose power-of-two scale constraints (two methods: M1, M2) to allow fast FP4→FP8 casting on H100 hardware with little loss. Tests on LLaMA and OPT (1.3B–30B) across Wikitext-2/PTB/C4 back the claims.

Problem Statement

Post-training quantization for LLMs must reduce memory and improve inference speed without hurting output quality. Integer uniform quantization (INT8/INT4) struggles with activation outliers and skewed distributions. The paper asks: can low-bit floating-point formats (FP8, FP4) plus light corrections keep quality while enabling efficient execution on FP-capable hardware?

Main Contribution

Show FP8 activations outperform INT8 for LLM PTQ, with larger gains on models >1B parameters.

Demonstrate FP4/FP8 weight quantization matches or beats INT4/INT8, enabling W4A8 FP deployment.

Introduce two power-of-two scale constraints (M1, M2) to allow efficient FP4→FP8 casting; M2 is better.

Show Low Rank Compensation (LoRC) reduces W4A8 quantization errors, particularly in smaller models.

Provide systematic tests across LLaMA and OPT families on Wikitext-2, PTB, and C4.

Key Findings

FP8 activations beat INT8 activations in perplexity across models, with larger wins for larger models.

NumbersLLaMA-7b W8A8: PPL 10.63 (INT) → 10.38 (FP); drop 0.25

Switching weights to FP4 with FP8 activations recovers quality versus INT4+FP8 and can improve results.

NumbersLLaMA-7b W4A8: PPL 11.48 (INT4) → 11.08 (FP4); drop 0.40

LoRC reduces W4A8 quantization error and helps smaller models recover quality under constrained settings.

NumbersSmall-model improvements observed (e.g., LLaMA/OPT small sizes show ~0.1–0.4 PPL recovery)

Constraining weight scales to powers-of-two (for efficient bit-shift casting) causes only minor quality loss if done carefully.

NumbersW4A8 with power-of-two scales shows small PPL change vs unconstrained; M2 performs better than M1

Results

Perplexity

ValueLLaMA-7b W8A8 INT→FP: 10.63 → 10.38

BaselineW8A8 INT activation

Perplexity

ValueLLaMA-7b W4A8 INT4→FP4: 11.48 → 11.08

BaselineW4A8 INT weights

Perplexity (scale constraint effect)

ValueW4A8 FP4 weights with power-of-two scales: minor PPL changes vs unconstrained

Baselineunconstrained FP4 scales

Who Should Care

What To Try In 7 Days

On a small LLM (1–7B), quantize activations to FP8 and weights to FP4 with GPTQ and evaluate PPL on a held-out set.

Add LoRC post-quantization to check if it recovers quality for your smaller models.

Implement M2 (grouped power-of-two scales) to enable fast FP4→FP8 casting and measure end-to-end latency on H100.

Optimization Features

Token Efficiency

  • Token-wise activation quantization (reduces per-token overhead)

Infra Optimization

  • Design for FP8 hardware (H100) to gain throughput

Model Optimization

  • FP8 activation quantization
  • FP4/FP8 weight quantization
  • LoRC low-rank correction

System Optimization

  • Target H100 FP8 execution path
  • Group-size 256 fine-grained quantization

Inference Optimization

  • Power-of-two scale constraints for fast casting
  • Token-wise activation quantization for latency

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation metric is mainly perplexity; downstream task effects not measured.
  • PTQ data uses 128 C4 sentences (lightweight calibration), may not cover all data regimes.
  • FP formats used are QTorch variants; slight differences vs NVIDIA H100 FP8 exist.
  • Latency claims assume H100-like FP8 hardware support; other accelerators may differ.

When Not To Use

  • If your deployment hardware lacks FP8/FP4 support (no H100), integer schemes may be faster.
  • If you need quantize-aware training or lower-bit integer compatibility (INT4-only toolchains).
  • When downstream tasks require metrics beyond perplexity without further validation.

Failure Modes

  • Outlier activations not covered by calibration may still cause quality drops.
  • Power-of-two scaling (M1) can degrade accuracy more than grouped (M2) if applied naively.
  • LoRC adds compute and memory overhead; may not help large models as much as small ones.

Core Entities

Models

  • LLaMA-1.3b
  • LLaMA-3b
  • LLaMA-7b
  • LLaMA-13b
  • LLaMA-30b
  • OPT-1.3b
  • OPT-6.7b
  • OPT-13b
  • OPT-30b

Metrics

  • Perplexity (lower better)

Datasets

  • Wikitext-2
  • PTB
  • C4