A practical guide to distilling big language models: methods, robustness tests, and domain apps

July 2, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

2

Authors

Chuanpeng Yang, Wang Lu, Yao Zhu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, Yiqiang Chen

Links

Abstract / PDF

Why It Matters For Business

Distillation cuts LLM inference cost and memory while keeping most capabilities; picking the right KD style (white‑box, black‑box, CoT, retrieval‑augmented) matters for accuracy, robustness, and deployment budget.

Summary TLDR

This is a 28‑page survey that organizes knowledge distillation (KD) for large language models (LLMs) into white‑box (logits and feature/hint) and black‑box (in‑context, chain‑of‑thought, instruction) methods. It reviews algorithm families, compares evaluation practices, runs unified robustness tests (adversarial and out‑of‑distribution) across teacher/student families, and surveys real applications in healthcare, education and law. The paper highlights method trade‑offs, gaps in unified benchmarks, and practical recommendations for choosing KD strategies.

Problem Statement

LLMs are powerful but expensive to run. We need reliable ways to shrink them (speed, memory) without losing key capabilities. Existing KD work started on smaller models and lacks unified evaluations and practical guidance for LLMs that are often closed‑source or multi‑capability.

Main Contribution

Taxonomy of KD for LLMs: white‑box (logits, hint) vs black‑box (ICL, CoT, instruction)

Survey and comparison of representative distillation methods and empirical tasks

Unified robustness evaluation (adversarial and OOD) of several KD algorithms on GPT‑2, OPT, LLaMA family

Discussion of practical applications (healthcare, education, law) and open challenges (benchmarks, interpretability, multimodal KD)

Key Findings

Hint‑based (feature) distillation often transfers richer information and yields higher task accuracy than logits‑only KD.

NumbersTinyBERT: ~97% GLUE vs BERT; MiniLM: >99% SQuAD/GLUE with 50% Transformer size

Sequence and reverse‑KL based KD (MINILLM) improve generation robustness and long responses versus standard KD.

NumbersMINILLM outperforms KD baselines on human/GPT‑4 feedback and ASR/OOD tests (see tables)

Black‑box CoT (chain‑of‑thought) distillation boosts multi‑step reasoning in small models, sometimes surpassing much larger models.

NumbersStep‑by‑step distillation reduced training data by >50% and a 770M model beat a 540B LLM after fine‑tuning

Distillation method effectiveness depends on model family and dataset: same KD can vary greatly across LLaMA, OPT, GPT‑2.

NumbersTables 2–4 show different best methods per model (MINILLM best for GPT‑2; simple token KD best for OPT; SeqKD/JS vary by

Retrieval‑augmented distillation can improve generalization with small latency cost.

NumbersReAugKD: superior performance on 6 datasets with latency overhead <3%

Results

GLUE performance retention

ValueDistilBERT retains ~97% of BERT

BaselineBERTbase

SQuAD/GLUE retention

ValueMiniLM retains >99% accuracy

BaselineBERTbase

ReAugKD latency overhead

Value<3% latency overhead

Baselinebaseline distillation

CoT data efficiency

Value>50% reduction in required labelled data

Baselinefine‑tuning / standard KD

Who Should Care

What To Try In 7 Days

Run a quick distillation baseline: logits KD + finetune a small student on task data to measure accuracy drop

If teacher internals exist, try hint/feature KD; otherwise collect LLM prompts or CoT outputs for black‑box distillation

Benchmark student on adversarial and OOD checks (AdvGLUE/ANLI or a small domain holdout) to catch robustness regressions

Optimization Features

Token Efficiency

  • CoT distillation reduces labelled data needs (>50%)
  • MiniMA shows best size ratio ~40% student size

Model Optimization

  • logits-based KD
  • hint/feature distillation
  • sequence-level (SeqKD) and f‑divergence
  • reverse‑KL generation KD

System Optimization

  • avoid backprop through teacher; reduce activation memory via data/mapping choices

Training Optimization

  • teacher mixed sampling
  • single‑step decomposition
  • length normalization
  • meta‑distillation (teacher fixed, pilot updates)

Inference Optimization

  • student model smaller size for lower latency
  • retrieval‑augmented distillation for OOD gains

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey relies on published papers and some closed‑source LLM results; direct reproducibility of closed models is limited
  • No single unified benchmark for KD—results depend on model family, task, and data used
  • Many black‑box distillation pipelines depend on costly API calls to closed LLMs

When Not To Use

  • If strict access to teacher internals is unavailable and you need layer‑wise feature transfer, white‑box KD is not possible
  • If you cannot afford LLM API costs, large-scale black‑box data generation is impractical
  • Avoid naive logits KD for complex generative or reasoning tasks without sequence/objective adjustments

Failure Modes

  • Student overfits teacher quirks or low‑quality generated labels when teacher filtering is weak
  • KD recipe selected for one model family may underperform or worsen robustness on another family
  • Feature/hint KD can blow up GPU memory due to activations from large teachers

Core Entities

Models

  • GPT‑3
  • GPT‑3.5
  • GPT‑4
  • GPT‑2
  • LLaMA
  • LLaMA2
  • OPT
  • T5
  • BERT
  • DistilBERT
  • MiniLM
  • TinyBERT
  • MINILLM
  • MiniMA

Metrics

  • ASR (Attack Success Rate)
  • F1
  • ROUGE‑L
  • human judgment / GPT‑4 feedback
  • relative % vs teacher (performance retention)

Datasets

  • GLUE
  • SQuAD2
  • MMLU
  • BIGBench
  • HELM
  • GSM8K
  • BBH
  • Dolly
  • AdvGLUE
  • ANLI
  • Flipkart reviews
  • DDXPlus
  • Dolly 1

Benchmarks

  • Adversarial GLUE (AdvGLUE)
  • ANLI
  • MMLU
  • GLUE
  • BBH
  • GSM8K