Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.85
Citation Count
1
Why It Matters For Business
BitDistiller makes deploying 2–3 bit LLMs practical: it keeps much of reasoning/code accuracy while slashing quantization time and GPU cost, enabling cheaper on-prem or edge inference.
Summary TLDR
BitDistiller combines quantization-aware training (QAT) with self-knowledge-distillation to make 3-bit and 2-bit versions of large language models much more usable. Key ingredients: asymmetric quantization with an initial asymmetric clipping step, and a Confidence-Aware KL divergence (CAKLD) that blends forward and reverse KL based on the teacher's token confidence. On LLaMA-2 and domain models (WizardCoder, MetaMath) BitDistiller improves perplexity and reasoning/code accuracy versus state-of-the-art PTQ/QAT baselines, while cutting training cost dramatically (e.g., quantizing WizardCoder-7B in ~3 GPU hours on one A100). Code is provided.
Problem Statement
Ultra-low-bit (sub-4-bit) quantization severely hurts LLM accuracy. Post-training quantization (PTQ) often fails at 2–3 bits, and prior QAT methods need lots of data and GPU time. The practical gap: how to preserve weight fidelity and effectively train low-bit models with limited resources.
Main Contribution
BitDistiller: a practical QAT + self-distillation pipeline for sub-4-bit LLMs.
Asymmetric quantization plus a single-shot asymmetric clipping initialization to reduce weight outliers and preserve fidelity.
CAKLD: a Confidence-Aware KL objective that blends forward/reverse KL using teacher token confidence, improving convergence.
Demonstrated gains on general language and reasoning/code benchmarks at 3-bit and 2-bit, with far lower GPU hours than prior QAT.
Key Findings
BitDistiller yields better language modeling and QA accuracy than prior PTQ and QAT on LLaMA-2-7B.
On reasoning and code, BitDistiller preserves much more accuracy at 2-bit than alternatives.
Asymmetric clipping at initialization strongly stabilizes sub-4-bit QAT and reduces perplexity.
CAKLD (confidence-weighted KL) converges faster and outperforms other distillation objectives in experiments.
BitDistiller is much cheaper to run than prior QAT pipelines in practice.
Results
LLaMA-2-7B PPL (3-bit g128)
LLaMA-2-7B MMLU (5-shot, 2-bit g128)
Accuracy
WizardCoder-7B HumanEval Pass@1 (2-bit g128)
Quantization time for WizardCoder-7B
Who Should Care
What To Try In 7 Days
Run the repo's 2-bit QAT recipe on a 7B model using the provided small calibration set and asymmetric clipping.
Replace your QAT loss with CAKLD and test teacher-generated data (temperature 0.7) to speed convergence.
Measure trade-offs: compare 4-bit baseline, 3-bit, and 2-bit BitDistiller outputs on a small reasoning workload.
Optimization Features
Token Efficiency
- Use of teacher-generated samples to expand distillation data cheaply
Infra Optimization
- Reported ability to quantize WizardCoder-7B in ~3 GPU hours on one A100
Model Optimization
- Asymmetric quantization (NF for >2-bit, INT for 2-bit)
- Group-wise quantization (group size 128; 64 for 3B)
System Optimization
- Single-shot asymmetric clipping initialization to avoid iterative expensive clipping
Training Optimization
- Quantization-aware training (QAT) with self-distillation
- CAKLD distillation objective (conf-weighted forward/reverse KL)
Inference Optimization
- Sub-4-bit weights (2-bit and 3-bit) for lower memory and compute
Reproducibility
Data Urls
- Alpaca (public)
- WikiText-2 (public)
- Evol-Instruct-Code (public repo referenced)
- MetaMathQA (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Empirical evidence only; theoretical reasons for some findings (e.g., teacher size effects) are unexplained.
- Current work focuses on scalar quantization; vector quantization and integration with QuIP# remain future work.
- Not yet evaluated for 1-bit (binary) weights; 2-bit is the lowest demonstrated.
When Not To Use
- If you require strict, provable worst-case accuracy guarantees from PTQ-only pipelines.
- When vector quantization initializations (QuIP# style) are already integrated and validated in your stack — integration untested.
- If you cannot run a short QAT/distillation step (requires small GPU time and teacher model access).
Failure Modes
- Training collapse at 2-bit without asymmetric clipping initialization.
- Teacher-student mismatch: larger teacher does not always improve student (reported 13B→7B case).
- Possible reduced benefit on models or tasks not evaluated (other architectures or extreme edge devices).
Core Entities
Models
- LLaMA-2
- WizardCoder
- MetaMath
- OpenLLaMA
- BitDistiller (method)
Metrics
- Perplexity (PPL)
- MMLU (5-shot)
- HumanEval Pass@1
- Accuracy
Datasets
- Alpaca
- WikiText-2
- Evol-Instruct-Code
- MetaMathQA
- GSM8K
Benchmarks
- WikiText-2
- MMLU
- PIQA
- HellaSwag
- WinoGrande
- ARC
- HumanEval
- GSM8K

