Overview
Evaluation covers multiple model families and datasets with GPTQ-based PTQ; shows consistent FP8/FP4 advantages but tests focus on perplexity and a limited calibration set (128 C4 sequences) so real-world performance should be validated.
Citations8
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Switching activations to FP8 and weights to FP4 can cut memory and exploit H100 FP8 hardware while keeping model quality—good for deploying large LLMs on constrained inference servers.
Who Should Care
Summary TLDR
This paper shows that post-training floating-point quantization (FP8 activations, FP4/FP8 weights) preserves LLM quality better than integer quantization (INT8/INT4), especially for models >1B parameters. They adapt GPTQ-style PTQ, add Low Rank Compensation (LoRC) to reduce error, and propose power-of-two scale constraints (two methods: M1, M2) to allow fast FP4→FP8 casting on H100 hardware with little loss. Tests on LLaMA and OPT (1.3B–30B) across Wikitext-2/PTB/C4 back the claims.
Problem Statement
Post-training quantization for LLMs must reduce memory and improve inference speed without hurting output quality. Integer uniform quantization (INT8/INT4) struggles with activation outliers and skewed distributions. The paper asks: can low-bit floating-point formats (FP8, FP4) plus light corrections keep quality while enabling efficient execution on FP-capable hardware?
Main Contribution
Show FP8 activations outperform INT8 for LLM PTQ, with larger gains on models >1B parameters.
Demonstrate FP4/FP8 weight quantization matches or beats INT4/INT8, enabling W4A8 FP deployment.
Key Findings
FP8 activations beat INT8 activations in perplexity across models, with larger wins for larger models.
Switching weights to FP4 with FP8 activations recovers quality versus INT4+FP8 and can improve results.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity | LLaMA-7b W8A8 INT→FP: 10.63 → 10.38 | W8A8 INT activation | -0.25 | average over Wikitext-2/PTB/C4 (Table 2) | Main Results paragraph and Table 2 | Table 2 |
| Perplexity | LLaMA-7b W4A8 INT4→FP4: 11.48 → 11.08 | W4A8 INT weights | -0.40 | average over Wikitext-2/PTB/C4 (Table 2) | Main Results paragraph and Table 2 | Table 2 |
What To Try In 7 Days
On a small LLM (1–7B), quantize activations to FP8 and weights to FP4 with GPTQ and evaluate PPL on a held-out set.
Add LoRC post-quantization to check if it recovers quality for your smaller models.
Implement M2 (grouped power-of-two scales) to enable fast FP4→FP8 casting and measure end-to-end latency on H100.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluation metric is mainly perplexity; downstream task effects not measured.
PTQ data uses 128 C4 sentences (lightweight calibration), may not cover all data regimes.
When Not To Use
If your deployment hardware lacks FP8/FP4 support (no H100), integer schemes may be faster.
If you need quantize-aware training or lower-bit integer compatibility (INT4-only toolchains).
Failure Modes
Outlier activations not covered by calibration may still cause quality drops.
Power-of-two scaling (M1) can degrade accuracy more than grouped (M2) if applied naively.

