Train quantized LLMs without original data and quantize KV cache to reach practical 4-bit weights

Overview

Decision SnapshotReady For Pilot

The paper gives repeated quantitative comparisons across multiple model sizes and benchmarks showing QAT beats PTQ at sub-8-bit settings and reports KV memory savings; hardware support for 4-bit activations is still missing.

Citations15

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra

Links

Abstract / PDF

Why It Matters For Business

LLM-QAT lets you reduce model memory and improve throughput by quantizing weights and KV cache to 4 bits while keeping quality close to full precision, which can lower hosting cost and enable longer contexts on the same hardware.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

LLM-QAT is a practical recipe to fine-tune large language models so their weights, activations and key-value (KV) cache can run at low bit-widths. It generates training data from the original model (data-free distillation) and uses quantization-aware training (QAT) with symmetric MinMax quantizers, per-channel weight and per-token activation quantization, and logits distillation. The method preserves downstream accuracy and perplexity much better than several post-training quantization (PTQ) baselines when bits ≤ 8, enabling useful 4-bit weight + 8-bit activation configurations and KV cache compression for LLaMA-7B/13B/30B on standard benchmarks. (See Tables 1–3, 6.)

Problem Statement

Post-training quantization methods break down below 8 bits and do not quantize the KV cache. Training-aware quantization (QAT) could help but needs large, representative training data. The paper asks: can we do QAT for LLMs without access to original pretraining data and compress the KV cache too?

Main Contribution

Data-free distillation: generate training sequences from the pre-trained model and use teacher logits as soft labels for QAT.

Apply QAT to LLMs including simultaneous quantization of weights, activations and the KV cache.

Key Findings

Data-free generated samples fine-tuned with logits distillation outperform real-data finetuning for zero-shot tasks.

NumbersGenerated-data (hybrid sampling) avg zero-shot 63.1 vs C4 61.5 (Table 3)

Practical UseIf you lack pretraining data, sample sequences from the teacher model (top-k hybrid then stochastic sampling) and use logits distillation for better generalization than using C4 or Wiki subsets.

Evidence RefTable 3

LLM-QAT preserves output distribution and downstream accuracy much better than PTQ at sub-8-bit settings.

Numbers30B, 8-8-4: LLM-QAT avg zero-shot 69.7 vs SmoothQuant 50.7 (Table 1)

Practical UseFor production use when bits < 8, prefer LLM-QAT over PTQ to avoid large accuracy drops—especially for large models.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	30B 8-8-4 LLM-QAT avg 69.7	30B SmoothQuant 8-8-4 avg 50.7	+19.0	Zero-shot common sense reasoning (Table 1)	Table 1 rows and text citing 8-8-4 comparison	Table 1
Accuracy	7B 4-8-4 LLM-QAT avg 60.7	7B SmoothQuant 4-8-4 avg 43.2	+17.5	Zero-shot common sense reasoning (Table 1)	Table 1 rows 2-4	Table 1

What To Try In 7 Days

Generate 100k sequences from your FP model using hybrid top-1 then stochastic sampling and save teacher logits.

Run small QAT on a copy of your model: per-channel MinMax weights, per-token activations, logit distillation.

Quantize the KV cache per-token and measure memory/throughput for your target sequence lengths.

Optimization Features

Token Efficiency

per-token activation and KV quantization to reduce runtime memory

Infra Optimization

reduces KV cache memory (e.g., 4× at 1k tokens for 30B model)

Model Optimization

weight quantization to 4-bitper-channel weight quantizationsymmetric MinMax quantizers

System Optimization

compatible with SmoothQuant weight-activation rescale in some low-bit cases

Training Optimization

quantization-aware training (QAT)logits-based knowledge distillationdata-free generation of training samples (~100k)

Inference Optimization

KV cache quantization (per-token)activation quantization to 6–8 bits

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

4-bit activation quantization not solved; experiments found it failed in their settings.

No end-to-end hardware implementation included — 4-bit inference hardware is not generally available.

When Not To Use

You need 4-bit activation inference today but lack specialized hardware.

You cannot run even modest QAT (authors used an 8-GPU node).

Failure Modes

Clipping outliers (clipping-based quantizers) causes extremely high perplexity and poor recovery.

Label-only distillation or hidden/attention distillation can underperform or harm accuracy.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA-30B

Metrics

AccuracyperplexityKV cache memory (GB)

Datasets

C4WikiText2WikiText103Wiki2BoolQPIQASIQAHellaSwagWinoGrandeARCOBQAMMLUTriviaQA

Benchmarks

Zero-shot common sense reasoningFew-shot MMLUTriviaQAPerplexity (WikiText2, C4)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Data-free generated samples fine-tuned with logits distillation outperform real-data finetuning for zero-shot tasks.

LLM-QAT preserves output distribution and downstream accuracy much better than PTQ at sub-8-bit settings.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding