Use self-distillation plus asymmetric sub-4-bit quantization to get practical 2–3 bit LLMs

Overview

Decision SnapshotReady For Pilot

The method shows clear empirical gains on multiple models and tasks with low compute, but the results are mainly empirical and need broader replication across architectures and deployment stacks.

Citations1

Evidence Strength0.78

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 60%

Novelty: 60%

Authors

Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

BitDistiller makes deploying 2–3 bit LLMs practical: it keeps much of reasoning/code accuracy while slashing quantization time and GPU cost, enabling cheaper on-prem or edge inference.

Who Should Care

ML Engineer Engineering Lead CTO Founder Product Manager

Summary TLDR

BitDistiller combines quantization-aware training (QAT) with self-knowledge-distillation to make 3-bit and 2-bit versions of large language models much more usable. Key ingredients: asymmetric quantization with an initial asymmetric clipping step, and a Confidence-Aware KL divergence (CAKLD) that blends forward and reverse KL based on the teacher's token confidence. On LLaMA-2 and domain models (WizardCoder, MetaMath) BitDistiller improves perplexity and reasoning/code accuracy versus state-of-the-art PTQ/QAT baselines, while cutting training cost dramatically (e.g., quantizing WizardCoder-7B in ~3 GPU hours on one A100). Code is provided.

Problem Statement

Ultra-low-bit (sub-4-bit) quantization severely hurts LLM accuracy. Post-training quantization (PTQ) often fails at 2–3 bits, and prior QAT methods need lots of data and GPU time. The practical gap: how to preserve weight fidelity and effectively train low-bit models with limited resources.

Main Contribution

BitDistiller: a practical QAT + self-distillation pipeline for sub-4-bit LLMs.

Asymmetric quantization plus a single-shot asymmetric clipping initialization to reduce weight outliers and preserve fidelity.

Key Findings

BitDistiller yields better language modeling and QA accuracy than prior PTQ and QAT on LLaMA-2-7B.

Numbers2-bit g128: MMLU 29.25 vs LLM-QAT 23.62 (Table 1)

Practical UseIf you need a usable 2-bit 7B model for general tasks, try BitDistiller rather than straight LLM-QAT or PTQ.

Evidence RefTable 1

On reasoning and code, BitDistiller preserves much more accuracy at 2-bit than alternatives.

NumbersMetaMath GSM8K 2-bit: 61.33% vs LLM-QAT 36.64% (Table 2)

Practical UseFor math/code tasks, use BitDistiller to keep practical pass rates when pushing to 2 bits.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LLaMA-2-7B PPL (3-bit g128)	5.97 (BitDistiller)	6.02 (LLM-QAT), 6.10 (OmniQuant)	≈ -0.05 vs LLM-QAT	WikiText-2 / general language	Table 1 BF16 vs quantized entries	Table 1
LLaMA-2-7B MMLU (5-shot, 2-bit g128)	29.25 (BitDistiller)	23.62 (LLM-QAT)	+5.63 pp	MMLU (5-shot)	Table 1 2-bit results	Table 1

What To Try In 7 Days

Run the repo's 2-bit QAT recipe on a 7B model using the provided small calibration set and asymmetric clipping.

Replace your QAT loss with CAKLD and test teacher-generated data (temperature 0.7) to speed convergence.

Measure trade-offs: compare 4-bit baseline, 3-bit, and 2-bit BitDistiller outputs on a small reasoning workload.

Optimization Features

Token Efficiency

Use of teacher-generated samples to expand distillation data cheaply

Infra Optimization

Reported ability to quantize WizardCoder-7B in ~3 GPU hours on one A100

Model Optimization

Asymmetric quantization (NF for >2-bit, INT for 2-bit)Group-wise quantization (group size 128; 64 for 3B)

System Optimization

Single-shot asymmetric clipping initialization to avoid iterative expensive clipping

Training Optimization

Quantization-aware training (QAT) with self-distillationCAKLD distillation objective (conf-weighted forward/reverse KL)

Inference Optimization

Sub-4-bit weights (2-bit and 3-bit) for lower memory and compute

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/DD-DuDa/BitDistiller

Data URLs

Alpaca (public)WikiText-2 (public)Evol-Instruct-Code (public repo referenced)MetaMathQA (public)

Risks & Boundaries

Limitations

Empirical evidence only; theoretical reasons for some findings (e.g., teacher size effects) are unexplained.

Current work focuses on scalar quantization; vector quantization and integration with QuIP# remain future work.

When Not To Use

If you require strict, provable worst-case accuracy guarantees from PTQ-only pipelines.

When vector quantization initializations (QuIP# style) are already integrated and validated in your stack — integration untested.

Failure Modes

Training collapse at 2-bit without asymmetric clipping initialization.

Teacher-student mismatch: larger teacher does not always improve student (reported 13B→7B case).

Core Entities

Models

LLaMA-2WizardCoderMetaMathOpenLLaMABitDistiller (method)

Metrics

Perplexity (PPL)MMLU (5-shot)HumanEval Pass@1Accuracy

Datasets

AlpacaWikiText-2Evol-Instruct-CodeMetaMathQAGSM8K

Benchmarks

WikiText-2MMLUPIQAHellaSwagWinoGrandeARCHumanEvalGSM8K

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

BitDistiller yields better language modeling and QA accuracy than prior PTQ and QAT on LLaMA-2-7B.

On reasoning and code, BitDistiller preserves much more accuracy at 2-bit than alternatives.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding