Train quantized LLMs without original data and quantize KV cache to reach practical 4-bit weights

May 29, 20237 min

Overview

Decision SnapshotReady For Pilot

The paper gives repeated quantitative comparisons across multiple model sizes and benchmarks showing QAT beats PTQ at sub-8-bit settings and reports KV memory savings; hardware support for 4-bit activations is still missing.

Citations15

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra

Links

Abstract / PDF

Why It Matters For Business

LLM-QAT lets you reduce model memory and improve throughput by quantizing weights and KV cache to 4 bits while keeping quality close to full precision, which can lower hosting cost and enable longer contexts on the same hardware.

Who Should Care

Summary TLDR

LLM-QAT is a practical recipe to fine-tune large language models so their weights, activations and key-value (KV) cache can run at low bit-widths. It generates training data from the original model (data-free distillation) and uses quantization-aware training (QAT) with symmetric MinMax quantizers, per-channel weight and per-token activation quantization, and logits distillation. The method preserves downstream accuracy and perplexity much better than several post-training quantization (PTQ) baselines when bits ≤ 8, enabling useful 4-bit weight + 8-bit activation configurations and KV cache compression for LLaMA-7B/13B/30B on standard benchmarks. (See Tables 1–3, 6.)

Problem Statement

Post-training quantization methods break down below 8 bits and do not quantize the KV cache. Training-aware quantization (QAT) could help but needs large, representative training data. The paper asks: can we do QAT for LLMs without access to original pretraining data and compress the KV cache too?

Main Contribution

Data-free distillation: generate training sequences from the pre-trained model and use teacher logits as soft labels for QAT.

Apply QAT to LLMs including simultaneous quantization of weights, activations and the KV cache.

Key Findings

Data-free generated samples fine-tuned with logits distillation outperform real-data finetuning for zero-shot tasks.

NumbersGenerated-data (hybrid sampling) avg zero-shot 63.1 vs C4 61.5 (Table 3)

Practical UseIf you lack pretraining data, sample sequences from the teacher model (top-k hybrid then stochastic sampling) and use logits distillation for better generalization than using C4 or Wiki subsets.

Evidence RefTable 3

LLM-QAT preserves output distribution and downstream accuracy much better than PTQ at sub-8-bit settings.

Numbers30B, 8-8-4: LLM-QAT avg zero-shot 69.7 vs SmoothQuant 50.7 (Table 1)

Practical UseFor production use when bits < 8, prefer LLM-QAT over PTQ to avoid large accuracy drops—especially for large models.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy30B 8-8-4 LLM-QAT avg 69.730B SmoothQuant 8-8-4 avg 50.7+19.0Zero-shot common sense reasoning (Table 1)Table 1 rows and text citing 8-8-4 comparisonTable 1
Accuracy7B 4-8-4 LLM-QAT avg 60.77B SmoothQuant 4-8-4 avg 43.2+17.5Zero-shot common sense reasoning (Table 1)Table 1 rows 2-4Table 1

What To Try In 7 Days

Generate 100k sequences from your FP model using hybrid top-1 then stochastic sampling and save teacher logits.

Run small QAT on a copy of your model: per-channel MinMax weights, per-token activations, logit distillation.

Quantize the KV cache per-token and measure memory/throughput for your target sequence lengths.

Optimization Features

Token Efficiency
per-token activation and KV quantization to reduce runtime memory
Infra Optimization
reduces KV cache memory (e.g., 4× at 1k tokens for 30B model)
Model Optimization
weight quantization to 4-bitper-channel weight quantizationsymmetric MinMax quantizers
System Optimization
compatible with SmoothQuant weight-activation rescale in some low-bit cases
Training Optimization
quantization-aware training (QAT)logits-based knowledge distillationdata-free generation of training samples (~100k)
Inference Optimization
KV cache quantization (per-token)activation quantization to 6–8 bits

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

4-bit activation quantization not solved; experiments found it failed in their settings.

No end-to-end hardware implementation included — 4-bit inference hardware is not generally available.

When Not To Use

You need 4-bit activation inference today but lack specialized hardware.

You cannot run even modest QAT (authors used an 8-GPU node).

Failure Modes

Clipping outliers (clipping-based quantizers) causes extremely high perplexity and poor recovery.

Label-only distillation or hidden/attention distillation can underperform or harm accuracy.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA-30B

Metrics

AccuracyperplexityKV cache memory (GB)

Datasets

C4WikiText2WikiText103Wiki2BoolQPIQASIQAHellaSwagWinoGrandeARCOBQAMMLUTriviaQA

Benchmarks

Zero-shot common sense reasoningFew-shot MMLUTriviaQAPerplexity (WikiText2, C4)