Use channel flattening to enable per-tensor INT4/INT8 math and halve compute time for large-batch LLM inference

February 28, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Yi Zhang, Fei Yang, Shuang Peng, Fangyu Wang, Aimin Pan

Links

Abstract / PDF

Why It Matters For Business

FlattenQuant cuts compute time and GPU memory in large-batch/long-sequence inference by enabling INT4/INT8 TensorCore math, which can lower infrastructure cost and increase throughput when hardware supports INT4.

Summary TLDR

FlattenQuant flattens (expands) channels that contain outlier values so the whole tensor becomes amenable to per-tensor quantization. That lets large linear layers run mixed INT4/INT8 matrix multiplies on TensorCores instead of FP16. On OPT-family models the authors convert ~48% of linear layers to INT4 with small accuracy loss, report up to 2× inference speedup and up to 2.3× GPU memory reduction in compute-bound settings, and show the tensor flatten step adds negligible overhead. Works best when inference is compute-bound and hardware supports INT4 TensorCores.

Problem Statement

Current LLM quantization often targets memory bottlenecks but still uses FP16 for compute, so large batch sizes or long sequences become compute-bound and slow. Fine-grained (per-channel/group) quantization preserves accuracy but prevents direct low-bit matrix multiplication on TensorCores. The problem: enable accurate low-bit per-tensor quantization so matrix multiplies run at INT4/INT8 speeds without large accuracy loss.

Main Contribution

Introduce FlattenQuant: detect channels with large values, expand those channels and repeat weight channels so tensors become flatter and per-tensor quantizable.

Show a quantization pipeline that mixes INT4 and INT8 per-tensor quantization, plus channel smoothing and outlier suppression, to preserve accuracy.

Demonstrate practical gains: ~48% layers in OPT can be INT4, giving up to 2× speedup and up to 2.3× GPU memory reduction in compute-bound inference.

Key Findings

FlattenQuant can convert roughly half of transformer linear layers to INT4 with small accuracy loss.

Numbers48.29% INT4 layers on OPT-30B (Table 4)

Inference speedup reaches about 2× versus FP16 in compute-bound settings.

Numbersup to 2× speedup (Abstract, Figure 4)

GPU memory use decreased significantly versus FP16 and also versus SmoothQuant.

Numbersup to 2.3× memory reduction (Abstract); FlattenQuant-O3 26.1–60.2 GB vs FP16 59.2–162.6 GB (Table 6)

Low-bit GEMM is much faster than FP16 GEMM and tensor flattening is cheap.

NumbersGemm FP16 3.12 ms vs INT4 1.57 ms; flatten 0.19 ms (Table 5)

Channel smoothing and suppressing outliers improve quantization accuracy.

NumbersWikiText PPL drops from 12.16 to 11.68 for OPT-6.7B with smoothing (Table 7,9)

Results

INT4 layer coverage

Value≈48.29% of linear layers

Inference speed (compute-bound)

Valueup to 2× faster

BaselineFP16

GPU memory reduction

Valueup to 2.3× lower

BaselineFP16

GEMM latency

ValueFP16 3.12 ms → INT4 1.57 ms

BaselineFP16 GEMM

Flatten operation overhead

Value0.19 ms

BaselineGEMM times (Table 5)

Who Should Care

What To Try In 7 Days

Run a calibration pass and compute per-layer KL scores to test which layers tolerate INT4, following the paper’s γ threshold.

Implement channel smoothing and outlier suppression on a small model to measure PPL/accuracy change before full rollout.

Benchmark mixed INT4/INT8 per-tensor GEMM on your A100-like GPUs (or CUTLASS kernels) to measure real speedups.

Optimization Features

Infra Optimization

  • implemented with CUTLASS kernels on A100 GPUs
  • requires TensorCores that support INT4

Model Optimization

  • per-tensor INT4/INT8 quantization
  • channel smoothing (move scales between activation and weight)
  • flattening outlier channels and repeating weight channels

System Optimization

  • operator fusion advised (flatten + repeat + GEMM in one kernel)
  • Accuracy

Inference Optimization

  • use INT4/INT8 GEMM on TensorCores
  • select per-layer precision via KL-divergence threshold γ
  • pad flattened channels to multiples of 32 for block GEMM

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Benefits mostly in compute-bound scenarios (large batches or long sequences).
  • Requires hardware with INT4 TensorCore support and custom low-bit GEMM kernels.
  • Does not quantize key-value cache to 8/4 bits in experiments; KV cache still often needs higher precision.
  • Flattening increases channel count (padding), so aggressive flattening can raise memory use.

When Not To Use

  • When workloads are memory-bound and FP16 already fits and performs well.
  • On hardware without INT4/INT8 TensorCore support.
  • When absolute highest accuracy is required and per-channel quantization is mandated.

Failure Modes

  • Over-aggressive INT4 allocation (γ too large) causes noticeable accuracy drop (Table 10).
  • Choosing β too small forces >30% flatten ratio and raises GPU memory without commensurate accuracy gain (Table 8).
  • Poor outlier suppression can bias truncation threshold and hurt quantization scaling.

Core Entities

Models

  • OPT-125M
  • OPT-1.3B
  • OPT-6.7B
  • OPT-13B
  • OPT-30B
  • OPT-66B

Metrics

  • Accuracy
  • perplexity
  • latency (ms)
  • GPU memory (GB)
  • INT4 layer ratio (%)
  • flatten ratio (%)

Datasets

  • OpenBookQA
  • LAMBADA (OpenAI)
  • PIQA
  • HellaSwag
  • WinoGrande
  • WikiText2
  • lm-eval-harness

Benchmarks

  • Accuracy
  • language modeling perplexity
  • inference latency