A practical guide to distilling big language models: methods, robustness tests, and domain apps

July 2, 20248 min

Overview

Decision SnapshotNeeds Validation

The survey compiles many scalable KD recipes and tests robustness; methods are practically useful but require per‑model tuning and access constraints (white‑box vs black‑box).

Citations2

Evidence Strength0.70

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Chuanpeng Yang, Wang Lu, Yao Zhu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, Yiqiang Chen

Links

Abstract / PDF

Why It Matters For Business

Distillation cuts LLM inference cost and memory while keeping most capabilities; picking the right KD style (white‑box, black‑box, CoT, retrieval‑augmented) matters for accuracy, robustness, and deployment budget.

Who Should Care

Summary TLDR

This is a 28‑page survey that organizes knowledge distillation (KD) for large language models (LLMs) into white‑box (logits and feature/hint) and black‑box (in‑context, chain‑of‑thought, instruction) methods. It reviews algorithm families, compares evaluation practices, runs unified robustness tests (adversarial and out‑of‑distribution) across teacher/student families, and surveys real applications in healthcare, education and law. The paper highlights method trade‑offs, gaps in unified benchmarks, and practical recommendations for choosing KD strategies.

Problem Statement

LLMs are powerful but expensive to run. We need reliable ways to shrink them (speed, memory) without losing key capabilities. Existing KD work started on smaller models and lacks unified evaluations and practical guidance for LLMs that are often closed‑source or multi‑capability.

Main Contribution

Taxonomy of KD for LLMs: white‑box (logits, hint) vs black‑box (ICL, CoT, instruction)

Survey and comparison of representative distillation methods and empirical tasks

Key Findings

Hint‑based (feature) distillation often transfers richer information and yields higher task accuracy than logits‑only KD.

NumbersTinyBERT: ~97% GLUE vs BERT; MiniLM: >99% SQuAD/GLUE with 50% Transformer size

Practical UseWhen you can access teacher internals, prefer hint/feature distillation to keep accuracy when compressing models.

Evidence RefTable 1; Sec.3.1.2

Sequence and reverse‑KL based KD (MINILLM) improve generation robustness and long responses versus standard KD.

NumbersMINILLM outperforms KD baselines on human/GPT‑4 feedback and ASR/OOD tests (see tables)

Practical UseFor generative LLM distillation, use sequence‑level objectives or reverse‑KL variants to reduce exposure bias and improve OOD behavior.

Evidence RefSec.3.1.1, Tables 2–4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GLUE performance retentionDistilBERT retains ~97% of BERTBERTbase−3% relativeGLUEDistilBERT reported ~97% performance vs BERT on GLUETable 1
SQuAD/GLUE retentionMiniLM retains >99% accuracyBERTbase≈0% relativeSQuAD2 / GLUEMiniLM retained >99% on SQuAD2 and GLUE with 50% Transformer paramsSec.3.1.2

What To Try In 7 Days

Run a quick distillation baseline: logits KD + finetune a small student on task data to measure accuracy drop

If teacher internals exist, try hint/feature KD; otherwise collect LLM prompts or CoT outputs for black‑box distillation

Benchmark student on adversarial and OOD checks (AdvGLUE/ANLI or a small domain holdout) to catch robustness regressions

Optimization Features

Token Efficiency
CoT distillation reduces labelled data needs (>50%)MiniMA shows best size ratio ~40% student size
Model Optimization
logits-based KDhint/feature distillationsequence-level (SeqKD) and f‑divergencereverse‑KL generation KD
System Optimization
avoid backprop through teacher; reduce activation memory via data/mapping choices
Training Optimization
teacher mixed samplingsingle‑step decompositionlength normalizationmeta‑distillation (teacher fixed, pilot updates)
Inference Optimization
student model smaller size for lower latencyretrieval‑augmented distillation for OOD gains

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Survey relies on published papers and some closed‑source LLM results; direct reproducibility of closed models is limited

No single unified benchmark for KD—results depend on model family, task, and data used

When Not To Use

If strict access to teacher internals is unavailable and you need layer‑wise feature transfer, white‑box KD is not possible

If you cannot afford LLM API costs, large-scale black‑box data generation is impractical

Failure Modes

Student overfits teacher quirks or low‑quality generated labels when teacher filtering is weak

KD recipe selected for one model family may underperform or worsen robustness on another family

Core Entities

Models

GPT‑3GPT‑3.5GPT‑4GPT‑2LLaMALLaMA2OPTT5BERTDistilBERTMiniLMTinyBERTMINILLMMiniMA

Metrics

ASR (Attack Success Rate)F1ROUGE‑Lhuman judgment / GPT‑4 feedbackrelative % vs teacher (performance retention)

Datasets

GLUESQuAD2MMLUBIGBenchHELMGSM8KBBHDollyAdvGLUEANLIFlipkart reviewsDDXPlusDolly 1

Benchmarks

Adversarial GLUE (AdvGLUE)ANLIMMLUGLUEBBHGSM8K