Overview
Paper provides quantitative latency, memory, and accuracy results on OPT models and ablations; results are reproducible in similar GPU environments but require INT4 TensorCores and custom kernels.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
FlattenQuant cuts compute time and GPU memory in large-batch/long-sequence inference by enabling INT4/INT8 TensorCore math, which can lower infrastructure cost and increase throughput when hardware supports INT4.
Who Should Care
Summary TLDR
FlattenQuant flattens (expands) channels that contain outlier values so the whole tensor becomes amenable to per-tensor quantization. That lets large linear layers run mixed INT4/INT8 matrix multiplies on TensorCores instead of FP16. On OPT-family models the authors convert ~48% of linear layers to INT4 with small accuracy loss, report up to 2× inference speedup and up to 2.3× GPU memory reduction in compute-bound settings, and show the tensor flatten step adds negligible overhead. Works best when inference is compute-bound and hardware supports INT4 TensorCores.
Problem Statement
Current LLM quantization often targets memory bottlenecks but still uses FP16 for compute, so large batch sizes or long sequences become compute-bound and slow. Fine-grained (per-channel/group) quantization preserves accuracy but prevents direct low-bit matrix multiplication on TensorCores. The problem: enable accurate low-bit per-tensor quantization so matrix multiplies run at INT4/INT8 speeds without large accuracy loss.
Main Contribution
Introduce FlattenQuant: detect channels with large values, expand those channels and repeat weight channels so tensors become flatter and per-tensor quantizable.
Show a quantization pipeline that mixes INT4 and INT8 per-tensor quantization, plus channel smoothing and outlier suppression, to preserve accuracy.
Key Findings
FlattenQuant can convert roughly half of transformer linear layers to INT4 with small accuracy loss.
Inference speedup reaches about 2× versus FP16 in compute-bound settings.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| INT4 layer coverage | ≈48.29% of linear layers | — | — | OPT-30B (Table 4) | Table 4 shows INT4 layers 48.29% for OPT-30B | Table 4 |
| Inference speed (compute-bound) | up to 2× faster | FP16 | ≈2× | OPT models, large batch/long sequence (Figure 4) | Figure 4 and Abstract report up to 2× speedup vs FP16 | Figure 4 |
What To Try In 7 Days
Run a calibration pass and compute per-layer KL scores to test which layers tolerate INT4, following the paper’s γ threshold.
Implement channel smoothing and outlier suppression on a small model to measure PPL/accuracy change before full rollout.
Benchmark mixed INT4/INT8 per-tensor GEMM on your A100-like GPUs (or CUTLASS kernels) to measure real speedups.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Benefits mostly in compute-bound scenarios (large batches or long sequences).
Requires hardware with INT4 TensorCore support and custom low-bit GEMM kernels.
When Not To Use
When workloads are memory-bound and FP16 already fits and performs well.
On hardware without INT4/INT8 TensorCore support.
Failure Modes
Over-aggressive INT4 allocation (γ too large) causes noticeable accuracy drop (Table 10).
Choosing β too small forces >30% flatten ratio and raises GPU memory without commensurate accuracy gain (Table 8).

