Use smoothed soft-label distillation during finetuning to reduce LLM hallucinations

February 16, 20257 min

Overview

Decision SnapshotReady For Pilot

The method is simple to add to finetuning, shows consistent empirical reductions in faithfulness hallucination across models and sizes, but depends on teacher quality and adds teacher-inference cost.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 70%

Novelty: 50%

Authors

Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, Kristina Lerman

Links

Abstract / PDF / Data

Why It Matters For Business

Smoothing labels with KD can reduce made-up facts in summaries and answers, improving trustworthiness in high-stakes apps while keeping core model accuracy.

Who Should Care

Summary TLDR

The paper shows that replacing hard one-hot labels with teacher-generated soft probability labels (knowledge distillation, KD) during instruction finetuning reduces faithfulness hallucination in several LLM families. They finetune students on Dolly using larger teachers, test on CNN/Daily Mail and XSUM using ROUGE-L, a factual-consistency classifier, and a factual-rate attention metric, and find consistent reductions in hallucination while keeping or improving general NLP task performance.

Problem Statement

Hard one-hot labels force models to be overconfident and ignore plausible alternatives. That overconfidence can cause models to invent facts (hallucinate). The paper asks: can teacher-generated soft labels (KD) smooth supervision and reduce faithfulness hallucination without hurting general performance?

Main Contribution

Propose smoothed soft-label training via knowledge distillation during instruction finetuning to reduce hallucination.

Systematic experiments across model families (Llama-2, Llama-3.1, Qwen-2.5) and small models showing KD reduces faithfulness hallucination.

Key Findings

KD reduces faithfulness hallucination on summarization benchmarks.

NumbersLlama-2-7B ROUGE-L 28.028.8; Factual Consistency 86.3%87.7%

Practical UseIf you finetune Llama-2-style models with teacher soft labels, expect modest but reliable gains in faithfulness on summarization tasks.

Evidence RefTable 1

KD maintains or improves general reasoning/comprehension performance.

NumbersLlama-2-7B ARC_Challenge 38.4%39.6% (KD)

Practical UseYou can add KD during instruction finetuning without losing accuracy on typical QA and commonsense tasks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ROUGE-L (CNNDM) for Llama-2-7BSFT 28.0% → KD 28.8% (+0.8)SFT 28.0%+0.8CNNDM testKD improves n-gram overlap modestlyTable 1
Factual Consistency (CNNDM) for Llama-2-7BSFT 86.3% → KD 87.7% (+1.4)SFT 86.3%+1.4CNNDM testClassifier-based grounding improvedTable 1

What To Try In 7 Days

Pick one production finetuning job and add KD: distill from a slightly larger well-calibrated teacher.

Precompute teacher soft labels to avoid repeated teacher inference during student training.

Run a small ablation on KD weight α (try 0.1 and 1.0) and compare ROUGE-L plus a factual-consistency classifier before rollout.

Optimization Features

Infra Optimization
Requires extra compute for teacher inference; experiments ran on four H100s for ~1 hour
Model Optimization
Knowledge distillation with teacher soft labels
Training Optimization
Apply KD during instruction finetuning (sequence and word-level)Hyperparameter sweep over KD weight α
Inference Optimization
Precompute teacher logits/soft labels to avoid repeated teacher inference

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Dolly (finetune dataset)CNN/Daily MailXSUMARCHellaSwagOpenBookQA

Risks & Boundaries

Limitations

KD effectiveness depends on teacher calibration; a hallucinating teacher can transfer errors.

Experiments limited to instruction finetuning on Dolly, not full pretraining KD.

When Not To Use

When you lack a reliable teacher model or cannot afford teacher inference.

If your primary problem is factuality errors tied to parametric knowledge rather than faithfulness to inputs.

Failure Modes

Student inherits teacher hallucinations if the teacher is incorrect or biased.

Metric mismatch: improvements in ROUGE-L can mask factual defects detected by attention-based metrics.

Core Entities

Models

Llama-2-7BLlama-2-13BLlama-3.1-8BLlama-3.1-70BQwen-2.5-7BQwen-2.5-32BMistral-7BPythia-6.9BBloomz-560MOPT-350Mmt0-580MPythia-1B

Metrics

ROUGE-LFactual Consistency (Vectara classifier)Factual Rate (LookbackLens attention metric)CHRFBERTScoreMETEORAccuracy

Datasets

DollyCNN/Daily MailXSUMARCHellaSwagOpenBookQACommonsenseQAPubMedQADialogSumHotpotQATruthfulQA

Benchmarks

hallucination leaderboardlmevaluation-harness