Use channel flattening to enable per-tensor INT4/INT8 math and halve compute time for large-batch LLM inference

February 28, 20248 min

Overview

Decision SnapshotNeeds Validation

Paper provides quantitative latency, memory, and accuracy results on OPT models and ablations; results are reproducible in similar GPU environments but require INT4 TensorCores and custom kernels.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yi Zhang, Fei Yang, Shuang Peng, Fangyu Wang, Aimin Pan

Links

Abstract / PDF

Why It Matters For Business

FlattenQuant cuts compute time and GPU memory in large-batch/long-sequence inference by enabling INT4/INT8 TensorCore math, which can lower infrastructure cost and increase throughput when hardware supports INT4.

Who Should Care

Summary TLDR

FlattenQuant flattens (expands) channels that contain outlier values so the whole tensor becomes amenable to per-tensor quantization. That lets large linear layers run mixed INT4/INT8 matrix multiplies on TensorCores instead of FP16. On OPT-family models the authors convert ~48% of linear layers to INT4 with small accuracy loss, report up to 2× inference speedup and up to 2.3× GPU memory reduction in compute-bound settings, and show the tensor flatten step adds negligible overhead. Works best when inference is compute-bound and hardware supports INT4 TensorCores.

Problem Statement

Current LLM quantization often targets memory bottlenecks but still uses FP16 for compute, so large batch sizes or long sequences become compute-bound and slow. Fine-grained (per-channel/group) quantization preserves accuracy but prevents direct low-bit matrix multiplication on TensorCores. The problem: enable accurate low-bit per-tensor quantization so matrix multiplies run at INT4/INT8 speeds without large accuracy loss.

Main Contribution

Introduce FlattenQuant: detect channels with large values, expand those channels and repeat weight channels so tensors become flatter and per-tensor quantizable.

Show a quantization pipeline that mixes INT4 and INT8 per-tensor quantization, plus channel smoothing and outlier suppression, to preserve accuracy.

Key Findings

FlattenQuant can convert roughly half of transformer linear layers to INT4 with small accuracy loss.

Numbers48.29% INT4 layers on OPT-30B (Table 4)

Practical UseYou can run about half of large-model linear ops at INT4 precision to cut compute time while keeping accuracy near baseline.

Evidence RefTable 4

Inference speedup reaches about 2× versus FP16 in compute-bound settings.

Numbersup to speedup (Abstract, Figure 4)

Practical UseFor large batch or long-sequence workloads, change FP16 matmuls to FlattenQuant’s INT4/INT8 per-tensor math to cut latency roughly in half.

Evidence RefFigure 4, Abstract

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
INT4 layer coverage≈48.29% of linear layersOPT-30B (Table 4)Table 4 shows INT4 layers 48.29% for OPT-30BTable 4
Inference speed (compute-bound)up to fasterFP16≈2×OPT models, large batch/long sequence (Figure 4)Figure 4 and Abstract report up to 2× speedup vs FP16Figure 4

What To Try In 7 Days

Run a calibration pass and compute per-layer KL scores to test which layers tolerate INT4, following the paper’s γ threshold.

Implement channel smoothing and outlier suppression on a small model to measure PPL/accuracy change before full rollout.

Benchmark mixed INT4/INT8 per-tensor GEMM on your A100-like GPUs (or CUTLASS kernels) to measure real speedups.

Optimization Features

Infra Optimization
implemented with CUTLASS kernels on A100 GPUsrequires TensorCores that support INT4
Model Optimization
per-tensor INT4/INT8 quantizationchannel smoothing (move scales between activation and weight)flattening outlier channels and repeating weight channels
System Optimization
operator fusion advised (flatten + repeat + GEMM in one kernel)Accuracy
Inference Optimization
use INT4/INT8 GEMM on TensorCoresselect per-layer precision via KL-divergence threshold γpad flattened channels to multiples of 32 for block GEMM

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Benefits mostly in compute-bound scenarios (large batches or long sequences).

Requires hardware with INT4 TensorCore support and custom low-bit GEMM kernels.

When Not To Use

When workloads are memory-bound and FP16 already fits and performs well.

On hardware without INT4/INT8 TensorCore support.

Failure Modes

Over-aggressive INT4 allocation (γ too large) causes noticeable accuracy drop (Table 10).

Choosing β too small forces >30% flatten ratio and raises GPU memory without commensurate accuracy gain (Table 8).

Core Entities

Models

OPT-125MOPT-1.3BOPT-6.7BOPT-13BOPT-30BOPT-66B

Metrics

Accuracyperplexitylatency (ms)GPU memory (GB)INT4 layer ratio (%)flatten ratio (%)

Datasets

OpenBookQALAMBADA (OpenAI)PIQAHellaSwagWinoGrandeWikiText2lm-eval-harness

Benchmarks

Accuracylanguage modeling perplexityinference latency