Overview
The method is simple to add to finetuning, shows consistent empirical reductions in faithfulness hallucination across models and sizes, but depends on teacher quality and adds teacher-inference cost.
Citations1
Evidence Strength0.80
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Smoothing labels with KD can reduce made-up facts in summaries and answers, improving trustworthiness in high-stakes apps while keeping core model accuracy.
Who Should Care
Summary TLDR
The paper shows that replacing hard one-hot labels with teacher-generated soft probability labels (knowledge distillation, KD) during instruction finetuning reduces faithfulness hallucination in several LLM families. They finetune students on Dolly using larger teachers, test on CNN/Daily Mail and XSUM using ROUGE-L, a factual-consistency classifier, and a factual-rate attention metric, and find consistent reductions in hallucination while keeping or improving general NLP task performance.
Problem Statement
Hard one-hot labels force models to be overconfident and ignore plausible alternatives. That overconfidence can cause models to invent facts (hallucinate). The paper asks: can teacher-generated soft labels (KD) smooth supervision and reduce faithfulness hallucination without hurting general performance?
Main Contribution
Propose smoothed soft-label training via knowledge distillation during instruction finetuning to reduce hallucination.
Systematic experiments across model families (Llama-2, Llama-3.1, Qwen-2.5) and small models showing KD reduces faithfulness hallucination.
Key Findings
KD reduces faithfulness hallucination on summarization benchmarks.
KD maintains or improves general reasoning/comprehension performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ROUGE-L (CNNDM) for Llama-2-7B | SFT 28.0% → KD 28.8% (+0.8) | SFT 28.0% | +0.8 | CNNDM test | KD improves n-gram overlap modestly | Table 1 |
| Factual Consistency (CNNDM) for Llama-2-7B | SFT 86.3% → KD 87.7% (+1.4) | SFT 86.3% | +1.4 | CNNDM test | Classifier-based grounding improved | Table 1 |
What To Try In 7 Days
Pick one production finetuning job and add KD: distill from a slightly larger well-calibrated teacher.
Precompute teacher soft labels to avoid repeated teacher inference during student training.
Run a small ablation on KD weight α (try 0.1 and 1.0) and compare ROUGE-L plus a factual-consistency classifier before rollout.
Optimization Features
Infra Optimization
Model Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
KD effectiveness depends on teacher calibration; a hallucinating teacher can transfer errors.
Experiments limited to instruction finetuning on Dolly, not full pretraining KD.
When Not To Use
When you lack a reliable teacher model or cannot afford teacher inference.
If your primary problem is factuality errors tied to parametric knowledge rather than faithfulness to inputs.
Failure Modes
Student inherits teacher hallucinations if the teacher is incorrect or biased.
Metric mismatch: improvements in ROUGE-L can mask factual defects detected by attention-based metrics.

