Use self-distillation plus asymmetric sub-4-bit quantization to get practical 2–3 bit LLMs

February 16, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.85

Citation Count

1

Authors

Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu

Links

Abstract / PDF

Why It Matters For Business

BitDistiller makes deploying 2–3 bit LLMs practical: it keeps much of reasoning/code accuracy while slashing quantization time and GPU cost, enabling cheaper on-prem or edge inference.

Summary TLDR

BitDistiller combines quantization-aware training (QAT) with self-knowledge-distillation to make 3-bit and 2-bit versions of large language models much more usable. Key ingredients: asymmetric quantization with an initial asymmetric clipping step, and a Confidence-Aware KL divergence (CAKLD) that blends forward and reverse KL based on the teacher's token confidence. On LLaMA-2 and domain models (WizardCoder, MetaMath) BitDistiller improves perplexity and reasoning/code accuracy versus state-of-the-art PTQ/QAT baselines, while cutting training cost dramatically (e.g., quantizing WizardCoder-7B in ~3 GPU hours on one A100). Code is provided.

Problem Statement

Ultra-low-bit (sub-4-bit) quantization severely hurts LLM accuracy. Post-training quantization (PTQ) often fails at 2–3 bits, and prior QAT methods need lots of data and GPU time. The practical gap: how to preserve weight fidelity and effectively train low-bit models with limited resources.

Main Contribution

BitDistiller: a practical QAT + self-distillation pipeline for sub-4-bit LLMs.

Asymmetric quantization plus a single-shot asymmetric clipping initialization to reduce weight outliers and preserve fidelity.

CAKLD: a Confidence-Aware KL objective that blends forward/reverse KL using teacher token confidence, improving convergence.

Demonstrated gains on general language and reasoning/code benchmarks at 3-bit and 2-bit, with far lower GPU hours than prior QAT.

Key Findings

BitDistiller yields better language modeling and QA accuracy than prior PTQ and QAT on LLaMA-2-7B.

Numbers2-bit g128: MMLU 29.25 vs LLM-QAT 23.62 (Table 1)

On reasoning and code, BitDistiller preserves much more accuracy at 2-bit than alternatives.

NumbersMetaMath GSM8K 2-bit: 61.33% vs LLM-QAT 36.64% (Table 2)

Asymmetric clipping at initialization strongly stabilizes sub-4-bit QAT and reduces perplexity.

NumbersLLaMA-2-7B 2-bit PPL start→end: 340→16.94 with Clip-Asym (Table 3)

CAKLD (confidence-weighted KL) converges faster and outperforms other distillation objectives in experiments.

NumbersConvergence and downstream gains shown vs TSLD and other objectives (Figures 6–7)

BitDistiller is much cheaper to run than prior QAT pipelines in practice.

NumbersWizardCoder-7B quantization: ~3.02 GPU hours vs LLM-QAT ~280.6 GPU hours reestimated to 1 GPU (Table 6)

Results

LLaMA-2-7B PPL (3-bit g128)

Value5.97 (BitDistiller)

Baseline6.02 (LLM-QAT), 6.10 (OmniQuant)

LLaMA-2-7B MMLU (5-shot, 2-bit g128)

Value29.25 (BitDistiller)

Baseline23.62 (LLM-QAT)

Accuracy

Value61.33% (BitDistiller)

Baseline36.64% (LLM-QAT)

WizardCoder-7B HumanEval Pass@1 (2-bit g128)

Value36.59% (BitDistiller)

Baseline14.63% (LLM-QAT)

Quantization time for WizardCoder-7B

Value≈3.02 GPU hours on single A100-80G (BitDistiller)

Baseline≈280.64 GPU hours (LLM-QAT reestimated to 1 GPU)

Who Should Care

What To Try In 7 Days

Run the repo's 2-bit QAT recipe on a 7B model using the provided small calibration set and asymmetric clipping.

Replace your QAT loss with CAKLD and test teacher-generated data (temperature 0.7) to speed convergence.

Measure trade-offs: compare 4-bit baseline, 3-bit, and 2-bit BitDistiller outputs on a small reasoning workload.

Optimization Features

Token Efficiency

  • Use of teacher-generated samples to expand distillation data cheaply

Infra Optimization

  • Reported ability to quantize WizardCoder-7B in ~3 GPU hours on one A100

Model Optimization

  • Asymmetric quantization (NF for >2-bit, INT for 2-bit)
  • Group-wise quantization (group size 128; 64 for 3B)

System Optimization

  • Single-shot asymmetric clipping initialization to avoid iterative expensive clipping

Training Optimization

  • Quantization-aware training (QAT) with self-distillation
  • CAKLD distillation objective (conf-weighted forward/reverse KL)

Inference Optimization

  • Sub-4-bit weights (2-bit and 3-bit) for lower memory and compute

Reproducibility

Data Urls

  • Alpaca (public)
  • WikiText-2 (public)
  • Evol-Instruct-Code (public repo referenced)
  • MetaMathQA (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Empirical evidence only; theoretical reasons for some findings (e.g., teacher size effects) are unexplained.
  • Current work focuses on scalar quantization; vector quantization and integration with QuIP# remain future work.
  • Not yet evaluated for 1-bit (binary) weights; 2-bit is the lowest demonstrated.

When Not To Use

  • If you require strict, provable worst-case accuracy guarantees from PTQ-only pipelines.
  • When vector quantization initializations (QuIP# style) are already integrated and validated in your stack — integration untested.
  • If you cannot run a short QAT/distillation step (requires small GPU time and teacher model access).

Failure Modes

  • Training collapse at 2-bit without asymmetric clipping initialization.
  • Teacher-student mismatch: larger teacher does not always improve student (reported 13B→7B case).
  • Possible reduced benefit on models or tasks not evaluated (other architectures or extreme edge devices).

Core Entities

Models

  • LLaMA-2
  • WizardCoder
  • MetaMath
  • OpenLLaMA
  • BitDistiller (method)

Metrics

  • Perplexity (PPL)
  • MMLU (5-shot)
  • HumanEval Pass@1
  • Accuracy

Datasets

  • Alpaca
  • WikiText-2
  • Evol-Instruct-Code
  • MetaMathQA
  • GSM8K

Benchmarks

  • WikiText-2
  • MMLU
  • PIQA
  • HellaSwag
  • WinoGrande
  • ARC
  • HumanEval
  • GSM8K