Train quantized LLMs without original data and quantize KV cache to reach practical 4-bit weights

May 29, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

15

Authors

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra

Links

Abstract / PDF

Why It Matters For Business

LLM-QAT lets you reduce model memory and improve throughput by quantizing weights and KV cache to 4 bits while keeping quality close to full precision, which can lower hosting cost and enable longer contexts on the same hardware.

Summary TLDR

LLM-QAT is a practical recipe to fine-tune large language models so their weights, activations and key-value (KV) cache can run at low bit-widths. It generates training data from the original model (data-free distillation) and uses quantization-aware training (QAT) with symmetric MinMax quantizers, per-channel weight and per-token activation quantization, and logits distillation. The method preserves downstream accuracy and perplexity much better than several post-training quantization (PTQ) baselines when bits ≤ 8, enabling useful 4-bit weight + 8-bit activation configurations and KV cache compression for LLaMA-7B/13B/30B on standard benchmarks. (See Tables 1–3, 6.)

Problem Statement

Post-training quantization methods break down below 8 bits and do not quantize the KV cache. Training-aware quantization (QAT) could help but needs large, representative training data. The paper asks: can we do QAT for LLMs without access to original pretraining data and compress the KV cache too?

Main Contribution

Data-free distillation: generate training sequences from the pre-trained model and use teacher logits as soft labels for QAT.

Apply QAT to LLMs including simultaneous quantization of weights, activations and the KV cache.

Show LLaMA-7B/13B/30B can be quantized to 4-bit weights + 8-bit activations with much smaller quality loss than PTQ, preserving perplexity and downstream task performance.

Key Findings

Data-free generated samples fine-tuned with logits distillation outperform real-data finetuning for zero-shot tasks.

NumbersGenerated-data (hybrid sampling) avg zero-shot 63.1 vs C4 61.5 (Table 3)

LLM-QAT preserves output distribution and downstream accuracy much better than PTQ at sub-8-bit settings.

Numbers30B, 8-8-4: LLM-QAT avg zero-shot 69.7 vs SmoothQuant 50.7 (Table 1)

Perplexity on held-out text is close to full precision after QAT even at 4-bit weights.

NumbersLLaMA-7B, 4-8-4 perplexity C4=8.6 vs FP=7.2 (Table 2)

Results

Accuracy

Value30B 8-8-4 LLM-QAT avg 69.7

Baseline30B SmoothQuant 8-8-4 avg 50.7

Accuracy

Value7B 4-8-4 LLM-QAT avg 60.7

Baseline7B SmoothQuant 4-8-4 avg 43.2

perplexity (C4)

Value7B 4-8-4 LLM-QAT 8.6

Baseline7B RTN 4-8-4 55.1, SmoothQuant 81.1

KV cache memory

ValueLLaMA-30B 4-bit KV cache at 1k tokens = 0.19 GB

Baseline16-bit KV cache at 1k tokens = 0.76 GB

Who Should Care

What To Try In 7 Days

Generate 100k sequences from your FP model using hybrid top-1 then stochastic sampling and save teacher logits.

Run small QAT on a copy of your model: per-channel MinMax weights, per-token activations, logit distillation.

Quantize the KV cache per-token and measure memory/throughput for your target sequence lengths.

Optimization Features

Token Efficiency

  • per-token activation and KV quantization to reduce runtime memory

Infra Optimization

  • reduces KV cache memory (e.g., 4× at 1k tokens for 30B model)

Model Optimization

  • weight quantization to 4-bit
  • per-channel weight quantization
  • symmetric MinMax quantizers

System Optimization

  • compatible with SmoothQuant weight-activation rescale in some low-bit cases

Training Optimization

  • quantization-aware training (QAT)
  • logits-based knowledge distillation
  • data-free generation of training samples (~100k)

Inference Optimization

  • KV cache quantization (per-token)
  • activation quantization to 6–8 bits

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • 4-bit activation quantization not solved; experiments found it failed in their settings.
  • No end-to-end hardware implementation included — 4-bit inference hardware is not generally available.
  • Evaluation limited to LLaMA family and standard NLP benchmarks; behavior on instruction-tuned or RL-tuned models not tested.

When Not To Use

  • You need 4-bit activation inference today but lack specialized hardware.
  • You cannot run even modest QAT (authors used an 8-GPU node).
  • You must preserve models trained with special instruction/RLHF stages without re-running stage-specific distillation.

Failure Modes

  • Clipping outliers (clipping-based quantizers) causes extremely high perplexity and poor recovery.
  • Label-only distillation or hidden/attention distillation can underperform or harm accuracy.
  • Mixing SmoothQuant with W4A8 may hurt accuracy according to ablations.

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • LLaMA-30B

Metrics

  • Accuracy
  • perplexity
  • KV cache memory (GB)

Datasets

  • C4
  • WikiText2
  • WikiText103
  • Wiki2
  • BoolQ
  • PIQA
  • SIQA
  • HellaSwag
  • WinoGrande
  • ARC
  • OBQA
  • MMLU
  • TriviaQA

Benchmarks

  • Zero-shot common sense reasoning
  • Few-shot MMLU
  • TriviaQA
  • Perplexity (WikiText2, C4)