Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
8
Why It Matters For Business
Switching activations to FP8 and weights to FP4 can cut memory and exploit H100 FP8 hardware while keeping model quality—good for deploying large LLMs on constrained inference servers.
Summary TLDR
This paper shows that post-training floating-point quantization (FP8 activations, FP4/FP8 weights) preserves LLM quality better than integer quantization (INT8/INT4), especially for models >1B parameters. They adapt GPTQ-style PTQ, add Low Rank Compensation (LoRC) to reduce error, and propose power-of-two scale constraints (two methods: M1, M2) to allow fast FP4→FP8 casting on H100 hardware with little loss. Tests on LLaMA and OPT (1.3B–30B) across Wikitext-2/PTB/C4 back the claims.
Problem Statement
Post-training quantization for LLMs must reduce memory and improve inference speed without hurting output quality. Integer uniform quantization (INT8/INT4) struggles with activation outliers and skewed distributions. The paper asks: can low-bit floating-point formats (FP8, FP4) plus light corrections keep quality while enabling efficient execution on FP-capable hardware?
Main Contribution
Show FP8 activations outperform INT8 for LLM PTQ, with larger gains on models >1B parameters.
Demonstrate FP4/FP8 weight quantization matches or beats INT4/INT8, enabling W4A8 FP deployment.
Introduce two power-of-two scale constraints (M1, M2) to allow efficient FP4→FP8 casting; M2 is better.
Show Low Rank Compensation (LoRC) reduces W4A8 quantization errors, particularly in smaller models.
Provide systematic tests across LLaMA and OPT families on Wikitext-2, PTB, and C4.
Key Findings
FP8 activations beat INT8 activations in perplexity across models, with larger wins for larger models.
Switching weights to FP4 with FP8 activations recovers quality versus INT4+FP8 and can improve results.
LoRC reduces W4A8 quantization error and helps smaller models recover quality under constrained settings.
Constraining weight scales to powers-of-two (for efficient bit-shift casting) causes only minor quality loss if done carefully.
Results
Perplexity
Perplexity
Perplexity (scale constraint effect)
Who Should Care
What To Try In 7 Days
On a small LLM (1–7B), quantize activations to FP8 and weights to FP4 with GPTQ and evaluate PPL on a held-out set.
Add LoRC post-quantization to check if it recovers quality for your smaller models.
Implement M2 (grouped power-of-two scales) to enable fast FP4→FP8 casting and measure end-to-end latency on H100.
Optimization Features
Token Efficiency
- Token-wise activation quantization (reduces per-token overhead)
Infra Optimization
- Design for FP8 hardware (H100) to gain throughput
Model Optimization
- FP8 activation quantization
- FP4/FP8 weight quantization
- LoRC low-rank correction
System Optimization
- Target H100 FP8 execution path
- Group-size 256 fine-grained quantization
Inference Optimization
- Power-of-two scale constraints for fast casting
- Token-wise activation quantization for latency
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation metric is mainly perplexity; downstream task effects not measured.
- PTQ data uses 128 C4 sentences (lightweight calibration), may not cover all data regimes.
- FP formats used are QTorch variants; slight differences vs NVIDIA H100 FP8 exist.
- Latency claims assume H100-like FP8 hardware support; other accelerators may differ.
When Not To Use
- If your deployment hardware lacks FP8/FP4 support (no H100), integer schemes may be faster.
- If you need quantize-aware training or lower-bit integer compatibility (INT4-only toolchains).
- When downstream tasks require metrics beyond perplexity without further validation.
Failure Modes
- Outlier activations not covered by calibration may still cause quality drops.
- Power-of-two scaling (M1) can degrade accuracy more than grouped (M2) if applied naively.
- LoRC adds compute and memory overhead; may not help large models as much as small ones.
Core Entities
Models
- LLaMA-1.3b
- LLaMA-3b
- LLaMA-7b
- LLaMA-13b
- LLaMA-30b
- OPT-1.3b
- OPT-6.7b
- OPT-13b
- OPT-30b
Metrics
- Perplexity (lower better)
Datasets
- Wikitext-2
- PTB
- C4

