A practical guide to distilling big language models: methods, robustness tests, and domain apps

Overview

Decision SnapshotNeeds Validation

The survey compiles many scalable KD recipes and tests robustness; methods are practically useful but require per‑model tuning and access constraints (white‑box vs black‑box).

Citations2

Evidence Strength0.70

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Chuanpeng Yang, Wang Lu, Yao Zhu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, Yiqiang Chen

Links

Abstract / PDF

Why It Matters For Business

Distillation cuts LLM inference cost and memory while keeping most capabilities; picking the right KD style (white‑box, black‑box, CoT, retrieval‑augmented) matters for accuracy, robustness, and deployment budget.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This is a 28‑page survey that organizes knowledge distillation (KD) for large language models (LLMs) into white‑box (logits and feature/hint) and black‑box (in‑context, chain‑of‑thought, instruction) methods. It reviews algorithm families, compares evaluation practices, runs unified robustness tests (adversarial and out‑of‑distribution) across teacher/student families, and surveys real applications in healthcare, education and law. The paper highlights method trade‑offs, gaps in unified benchmarks, and practical recommendations for choosing KD strategies.

Problem Statement

LLMs are powerful but expensive to run. We need reliable ways to shrink them (speed, memory) without losing key capabilities. Existing KD work started on smaller models and lacks unified evaluations and practical guidance for LLMs that are often closed‑source or multi‑capability.

Main Contribution

Taxonomy of KD for LLMs: white‑box (logits, hint) vs black‑box (ICL, CoT, instruction)

Survey and comparison of representative distillation methods and empirical tasks

Key Findings

Hint‑based (feature) distillation often transfers richer information and yields higher task accuracy than logits‑only KD.

NumbersTinyBERT: ~97% GLUE vs BERT; MiniLM: >99% SQuAD/GLUE with 50% Transformer size

Practical UseWhen you can access teacher internals, prefer hint/feature distillation to keep accuracy when compressing models.

Evidence RefTable 1; Sec.3.1.2

Sequence and reverse‑KL based KD (MINILLM) improve generation robustness and long responses versus standard KD.

NumbersMINILLM outperforms KD baselines on human/GPT‑4 feedback and ASR/OOD tests (see tables)

Practical UseFor generative LLM distillation, use sequence‑level objectives or reverse‑KL variants to reduce exposure bias and improve OOD behavior.

Evidence RefSec.3.1.1, Tables 2–4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GLUE performance retention	DistilBERT retains ~97% of BERT	BERTbase	−3% relative	GLUE	DistilBERT reported ~97% performance vs BERT on GLUE	Table 1
SQuAD/GLUE retention	MiniLM retains >99% accuracy	BERTbase	≈0% relative	SQuAD2 / GLUE	MiniLM retained >99% on SQuAD2 and GLUE with 50% Transformer params	Sec.3.1.2

What To Try In 7 Days

Run a quick distillation baseline: logits KD + finetune a small student on task data to measure accuracy drop

If teacher internals exist, try hint/feature KD; otherwise collect LLM prompts or CoT outputs for black‑box distillation

Benchmark student on adversarial and OOD checks (AdvGLUE/ANLI or a small domain holdout) to catch robustness regressions

Optimization Features

Token Efficiency

CoT distillation reduces labelled data needs (>50%)MiniMA shows best size ratio ~40% student size

Model Optimization

logits-based KDhint/feature distillationsequence-level (SeqKD) and f‑divergencereverse‑KL generation KD

System Optimization

avoid backprop through teacher; reduce activation memory via data/mapping choices

Training Optimization

teacher mixed samplingsingle‑step decompositionlength normalizationmeta‑distillation (teacher fixed, pilot updates)

Inference Optimization

student model smaller size for lower latencyretrieval‑augmented distillation for OOD gains

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Survey relies on published papers and some closed‑source LLM results; direct reproducibility of closed models is limited

No single unified benchmark for KD—results depend on model family, task, and data used

When Not To Use

If strict access to teacher internals is unavailable and you need layer‑wise feature transfer, white‑box KD is not possible

If you cannot afford LLM API costs, large-scale black‑box data generation is impractical

Failure Modes

Student overfits teacher quirks or low‑quality generated labels when teacher filtering is weak

KD recipe selected for one model family may underperform or worsen robustness on another family

Core Entities

Models

GPT‑3GPT‑3.5GPT‑4GPT‑2LLaMALLaMA2OPTT5BERTDistilBERTMiniLMTinyBERTMINILLMMiniMA

Metrics

ASR (Attack Success Rate)F1ROUGE‑Lhuman judgment / GPT‑4 feedbackrelative % vs teacher (performance retention)

Datasets

GLUESQuAD2MMLUBIGBenchHELMGSM8KBBHDollyAdvGLUEANLIFlipkart reviewsDDXPlusDolly 1

Benchmarks

Adversarial GLUE (AdvGLUE)ANLIMMLUGLUEBBHGSM8K

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Hint‑based (feature) distillation often transfers richer information and yields higher task accuracy than logits‑only KD.

Sequence and reverse‑KL based KD (MINILLM) improve generation robustness and long responses versus standard KD.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Distill retrieval+evidence and simple graphs from big LLMs into small LMs to cut hallucinations and inference cost

Key finding

Cut Qwen2-Audio translation models by ~40–50% storage while keeping ~97–100% quality

Key finding