Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
FlattenQuant cuts compute time and GPU memory in large-batch/long-sequence inference by enabling INT4/INT8 TensorCore math, which can lower infrastructure cost and increase throughput when hardware supports INT4.
Summary TLDR
FlattenQuant flattens (expands) channels that contain outlier values so the whole tensor becomes amenable to per-tensor quantization. That lets large linear layers run mixed INT4/INT8 matrix multiplies on TensorCores instead of FP16. On OPT-family models the authors convert ~48% of linear layers to INT4 with small accuracy loss, report up to 2× inference speedup and up to 2.3× GPU memory reduction in compute-bound settings, and show the tensor flatten step adds negligible overhead. Works best when inference is compute-bound and hardware supports INT4 TensorCores.
Problem Statement
Current LLM quantization often targets memory bottlenecks but still uses FP16 for compute, so large batch sizes or long sequences become compute-bound and slow. Fine-grained (per-channel/group) quantization preserves accuracy but prevents direct low-bit matrix multiplication on TensorCores. The problem: enable accurate low-bit per-tensor quantization so matrix multiplies run at INT4/INT8 speeds without large accuracy loss.
Main Contribution
Introduce FlattenQuant: detect channels with large values, expand those channels and repeat weight channels so tensors become flatter and per-tensor quantizable.
Show a quantization pipeline that mixes INT4 and INT8 per-tensor quantization, plus channel smoothing and outlier suppression, to preserve accuracy.
Demonstrate practical gains: ~48% layers in OPT can be INT4, giving up to 2× speedup and up to 2.3× GPU memory reduction in compute-bound inference.
Key Findings
FlattenQuant can convert roughly half of transformer linear layers to INT4 with small accuracy loss.
Inference speedup reaches about 2× versus FP16 in compute-bound settings.
GPU memory use decreased significantly versus FP16 and also versus SmoothQuant.
Low-bit GEMM is much faster than FP16 GEMM and tensor flattening is cheap.
Channel smoothing and suppressing outliers improve quantization accuracy.
Results
INT4 layer coverage
Inference speed (compute-bound)
GPU memory reduction
GEMM latency
Flatten operation overhead
Who Should Care
What To Try In 7 Days
Run a calibration pass and compute per-layer KL scores to test which layers tolerate INT4, following the paper’s γ threshold.
Implement channel smoothing and outlier suppression on a small model to measure PPL/accuracy change before full rollout.
Benchmark mixed INT4/INT8 per-tensor GEMM on your A100-like GPUs (or CUTLASS kernels) to measure real speedups.
Optimization Features
Infra Optimization
- implemented with CUTLASS kernels on A100 GPUs
- requires TensorCores that support INT4
Model Optimization
- per-tensor INT4/INT8 quantization
- channel smoothing (move scales between activation and weight)
- flattening outlier channels and repeating weight channels
System Optimization
- operator fusion advised (flatten + repeat + GEMM in one kernel)
- Accuracy
Inference Optimization
- use INT4/INT8 GEMM on TensorCores
- select per-layer precision via KL-divergence threshold γ
- pad flattened channels to multiples of 32 for block GEMM
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Benefits mostly in compute-bound scenarios (large batches or long sequences).
- Requires hardware with INT4 TensorCore support and custom low-bit GEMM kernels.
- Does not quantize key-value cache to 8/4 bits in experiments; KV cache still often needs higher precision.
- Flattening increases channel count (padding), so aggressive flattening can raise memory use.
When Not To Use
- When workloads are memory-bound and FP16 already fits and performs well.
- On hardware without INT4/INT8 TensorCore support.
- When absolute highest accuracy is required and per-channel quantization is mandated.
Failure Modes
- Over-aggressive INT4 allocation (γ too large) causes noticeable accuracy drop (Table 10).
- Choosing β too small forces >30% flatten ratio and raises GPU memory without commensurate accuracy gain (Table 8).
- Poor outlier suppression can bias truncation threshold and hurt quantization scaling.
Core Entities
Models
- OPT-125M
- OPT-1.3B
- OPT-6.7B
- OPT-13B
- OPT-30B
- OPT-66B
Metrics
- Accuracy
- perplexity
- latency (ms)
- GPU memory (GB)
- INT4 layer ratio (%)
- flatten ratio (%)
Datasets
- OpenBookQA
- LAMBADA (OpenAI)
- PIQA
- HellaSwag
- WinoGrande
- WikiText2
- lm-eval-harness
Benchmarks
- Accuracy
- language modeling perplexity
- inference latency

