Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.45
Citation Count
1
Why It Matters For Business
Smoothing labels with KD can reduce made-up facts in summaries and answers, improving trustworthiness in high-stakes apps while keeping core model accuracy.
Summary TLDR
The paper shows that replacing hard one-hot labels with teacher-generated soft probability labels (knowledge distillation, KD) during instruction finetuning reduces faithfulness hallucination in several LLM families. They finetune students on Dolly using larger teachers, test on CNN/Daily Mail and XSUM using ROUGE-L, a factual-consistency classifier, and a factual-rate attention metric, and find consistent reductions in hallucination while keeping or improving general NLP task performance.
Problem Statement
Hard one-hot labels force models to be overconfident and ignore plausible alternatives. That overconfidence can cause models to invent facts (hallucinate). The paper asks: can teacher-generated soft labels (KD) smooth supervision and reduce faithfulness hallucination without hurting general performance?
Main Contribution
Propose smoothed soft-label training via knowledge distillation during instruction finetuning to reduce hallucination.
Systematic experiments across model families (Llama-2, Llama-3.1, Qwen-2.5) and small models showing KD reduces faithfulness hallucination.
Show KD preserves or slightly improves downstream reasoning and comprehension metrics (ARC, HellaSwag, OpenBookQA).
Provide a qualitative case study showing fewer fabricated facts in KD summaries.
Key Findings
KD reduces faithfulness hallucination on summarization benchmarks.
KD maintains or improves general reasoning/comprehension performance.
KD benefits are visible across model sizes, from small (~350M) to mid (7–8B).
Some hallucination metrics disagree: gains in ROUGE-L or FC may not always match factual-rate changes.
KD effectiveness depends on teacher quality and adds compute during finetuning.
Results
ROUGE-L (CNNDM) for Llama-2-7B
Factual Consistency (CNNDM) for Llama-2-7B
Accuracy
Bloomz-560M TruthfulQA ROUGE-L
Metric disagreement example (Llama-2 on XSUM)
Who Should Care
What To Try In 7 Days
Pick one production finetuning job and add KD: distill from a slightly larger well-calibrated teacher.
Precompute teacher soft labels to avoid repeated teacher inference during student training.
Run a small ablation on KD weight α (try 0.1 and 1.0) and compare ROUGE-L plus a factual-consistency classifier before rollout.
Optimization Features
Infra Optimization
- Requires extra compute for teacher inference; experiments ran on four H100s for ~1 hour
Model Optimization
- Knowledge distillation with teacher soft labels
Training Optimization
- Apply KD during instruction finetuning (sequence and word-level)
- Hyperparameter sweep over KD weight α
Inference Optimization
- Precompute teacher logits/soft labels to avoid repeated teacher inference
Reproducibility
Data Urls
- Dolly (finetune dataset)
- CNN/Daily Mail
- XSUM
- ARC
- HellaSwag
- OpenBookQA
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- KD effectiveness depends on teacher calibration; a hallucinating teacher can transfer errors.
- Experiments limited to instruction finetuning on Dolly, not full pretraining KD.
- Study focuses on faithfulness hallucination; factuality (real-world correctness) was not systematically tested.
- KD adds compute overhead for teacher inference during training.
When Not To Use
- When you lack a reliable teacher model or cannot afford teacher inference.
- If your primary problem is factuality errors tied to parametric knowledge rather than faithfulness to inputs.
- When human evaluation or alternative metrics are required but unavailable.
Failure Modes
- Student inherits teacher hallucinations if the teacher is incorrect or biased.
- Metric mismatch: improvements in ROUGE-L can mask factual defects detected by attention-based metrics.
- Extra compute cost may make KD impractical for very large teacher ensembles.
Core Entities
Models
- Llama-2-7B
- Llama-2-13B
- Llama-3.1-8B
- Llama-3.1-70B
- Qwen-2.5-7B
- Qwen-2.5-32B
- Mistral-7B
- Pythia-6.9B
- Bloomz-560M
- OPT-350M
- mt0-580M
- Pythia-1B
Metrics
- ROUGE-L
- Factual Consistency (Vectara classifier)
- Factual Rate (LookbackLens attention metric)
- CHRF
- BERTScore
- METEOR
- Accuracy
Datasets
- Dolly
- CNN/Daily Mail
- XSUM
- ARC
- HellaSwag
- OpenBookQA
- CommonsenseQA
- PubMedQA
- DialogSum
- HotpotQA
- TruthfulQA
Benchmarks
- hallucination leaderboard
- lmevaluation-harness

