Use smoothed soft-label distillation during finetuning to reduce LLM hallucinations

Overview

Decision SnapshotReady For Pilot

The method is simple to add to finetuning, shows consistent empirical reductions in faithfulness hallucination across models and sizes, but depends on teacher quality and adds teacher-inference cost.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 70%

Novelty: 50%

Authors

Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, Kristina Lerman

Links

Abstract / PDF / Data

Why It Matters For Business

Smoothing labels with KD can reduce made-up facts in summaries and answers, improving trustworthiness in high-stakes apps while keeping core model accuracy.

Who Should Care

ML Engineer Product Manager CTO Engineering Lead Data Scientist

Summary TLDR

The paper shows that replacing hard one-hot labels with teacher-generated soft probability labels (knowledge distillation, KD) during instruction finetuning reduces faithfulness hallucination in several LLM families. They finetune students on Dolly using larger teachers, test on CNN/Daily Mail and XSUM using ROUGE-L, a factual-consistency classifier, and a factual-rate attention metric, and find consistent reductions in hallucination while keeping or improving general NLP task performance.

Problem Statement

Hard one-hot labels force models to be overconfident and ignore plausible alternatives. That overconfidence can cause models to invent facts (hallucinate). The paper asks: can teacher-generated soft labels (KD) smooth supervision and reduce faithfulness hallucination without hurting general performance?

Main Contribution

Propose smoothed soft-label training via knowledge distillation during instruction finetuning to reduce hallucination.

Systematic experiments across model families (Llama-2, Llama-3.1, Qwen-2.5) and small models showing KD reduces faithfulness hallucination.

Key Findings

KD reduces faithfulness hallucination on summarization benchmarks.

NumbersLlama-2-7B ROUGE-L 28.0→28.8; Factual Consistency 86.3%→87.7%

Practical UseIf you finetune Llama-2-style models with teacher soft labels, expect modest but reliable gains in faithfulness on summarization tasks.

Evidence RefTable 1

KD maintains or improves general reasoning/comprehension performance.

NumbersLlama-2-7B ARC_Challenge 38.4%→39.6% (KD)

Practical UseYou can add KD during instruction finetuning without losing accuracy on typical QA and commonsense tasks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ROUGE-L (CNNDM) for Llama-2-7B	SFT 28.0% → KD 28.8% (+0.8)	SFT 28.0%	+0.8	CNNDM test	KD improves n-gram overlap modestly	Table 1
Factual Consistency (CNNDM) for Llama-2-7B	SFT 86.3% → KD 87.7% (+1.4)	SFT 86.3%	+1.4	CNNDM test	Classifier-based grounding improved	Table 1

What To Try In 7 Days

Pick one production finetuning job and add KD: distill from a slightly larger well-calibrated teacher.

Precompute teacher soft labels to avoid repeated teacher inference during student training.

Run a small ablation on KD weight α (try 0.1 and 1.0) and compare ROUGE-L plus a factual-consistency classifier before rollout.

Optimization Features

Infra Optimization

Requires extra compute for teacher inference; experiments ran on four H100s for ~1 hour

Model Optimization

Knowledge distillation with teacher soft labels

Training Optimization

Apply KD during instruction finetuning (sequence and word-level)Hyperparameter sweep over KD weight α

Inference Optimization

Precompute teacher logits/soft labels to avoid repeated teacher inference

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Dolly (finetune dataset)CNN/Daily MailXSUMARCHellaSwagOpenBookQA

Risks & Boundaries

Limitations

KD effectiveness depends on teacher calibration; a hallucinating teacher can transfer errors.

Experiments limited to instruction finetuning on Dolly, not full pretraining KD.

When Not To Use

When you lack a reliable teacher model or cannot afford teacher inference.

If your primary problem is factuality errors tied to parametric knowledge rather than faithfulness to inputs.

Failure Modes

Student inherits teacher hallucinations if the teacher is incorrect or biased.

Metric mismatch: improvements in ROUGE-L can mask factual defects detected by attention-based metrics.

Core Entities

Models

Llama-2-7BLlama-2-13BLlama-3.1-8BLlama-3.1-70BQwen-2.5-7BQwen-2.5-32BMistral-7BPythia-6.9BBloomz-560MOPT-350Mmt0-580MPythia-1B

Metrics

ROUGE-LFactual Consistency (Vectara classifier)Factual Rate (LookbackLens attention metric)CHRFBERTScoreMETEORAccuracy

Datasets

DollyCNN/Daily MailXSUMARCHellaSwagOpenBookQACommonsenseQAPubMedQADialogSumHotpotQATruthfulQA

Benchmarks

hallucination leaderboardlmevaluation-harness

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

KD reduces faithfulness hallucination on summarization benchmarks.

KD maintains or improves general reasoning/comprehension performance.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding