Use smoothed soft-label distillation during finetuning to reduce LLM hallucinations

February 16, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.45

Citation Count

1

Authors

Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, Kristina Lerman

Links

Abstract / PDF

Why It Matters For Business

Smoothing labels with KD can reduce made-up facts in summaries and answers, improving trustworthiness in high-stakes apps while keeping core model accuracy.

Summary TLDR

The paper shows that replacing hard one-hot labels with teacher-generated soft probability labels (knowledge distillation, KD) during instruction finetuning reduces faithfulness hallucination in several LLM families. They finetune students on Dolly using larger teachers, test on CNN/Daily Mail and XSUM using ROUGE-L, a factual-consistency classifier, and a factual-rate attention metric, and find consistent reductions in hallucination while keeping or improving general NLP task performance.

Problem Statement

Hard one-hot labels force models to be overconfident and ignore plausible alternatives. That overconfidence can cause models to invent facts (hallucinate). The paper asks: can teacher-generated soft labels (KD) smooth supervision and reduce faithfulness hallucination without hurting general performance?

Main Contribution

Propose smoothed soft-label training via knowledge distillation during instruction finetuning to reduce hallucination.

Systematic experiments across model families (Llama-2, Llama-3.1, Qwen-2.5) and small models showing KD reduces faithfulness hallucination.

Show KD preserves or slightly improves downstream reasoning and comprehension metrics (ARC, HellaSwag, OpenBookQA).

Provide a qualitative case study showing fewer fabricated facts in KD summaries.

Key Findings

KD reduces faithfulness hallucination on summarization benchmarks.

NumbersLlama-2-7B ROUGE-L 28.0→28.8; Factual Consistency 86.3%→87.7%

KD maintains or improves general reasoning/comprehension performance.

NumbersLlama-2-7B ARC_Challenge 38.4%→39.6% (KD)

KD benefits are visible across model sizes, from small (~350M) to mid (7–8B).

NumbersBloomz-560M CNNDM ROUGE-L 20.4→20.8; TruthfulQA ROUGE-L 11.4→13.2

Some hallucination metrics disagree: gains in ROUGE-L or FC may not always match factual-rate changes.

NumbersLlama-2 XSUM FR fell slightly (91.2%→89.7%) while ROUGE-L/FC rose

KD effectiveness depends on teacher quality and adds compute during finetuning.

NumbersExperiments ran with teachers; training sessions ~1 hour on four NVIDIA H100 GPUs

Results

ROUGE-L (CNNDM) for Llama-2-7B

ValueSFT 28.0% → KD 28.8% (+0.8)

BaselineSFT 28.0%

Factual Consistency (CNNDM) for Llama-2-7B

ValueSFT 86.3% → KD 87.7% (+1.4)

BaselineSFT 86.3%

Accuracy

ValueSFT 38.4% → KD 39.6% (+1.2)

BaselineSFT 38.4%

Bloomz-560M TruthfulQA ROUGE-L

ValueSFT 11.4 → KD 13.2 (+1.8)

BaselineSFT 11.4

Metric disagreement example (Llama-2 on XSUM)

ValueROUGE-L/FC up, Factual Rate slightly down

Who Should Care

What To Try In 7 Days

Pick one production finetuning job and add KD: distill from a slightly larger well-calibrated teacher.

Precompute teacher soft labels to avoid repeated teacher inference during student training.

Run a small ablation on KD weight α (try 0.1 and 1.0) and compare ROUGE-L plus a factual-consistency classifier before rollout.

Optimization Features

Infra Optimization

  • Requires extra compute for teacher inference; experiments ran on four H100s for ~1 hour

Model Optimization

  • Knowledge distillation with teacher soft labels

Training Optimization

  • Apply KD during instruction finetuning (sequence and word-level)
  • Hyperparameter sweep over KD weight α

Inference Optimization

  • Precompute teacher logits/soft labels to avoid repeated teacher inference

Reproducibility

Data Urls

  • Dolly (finetune dataset)
  • CNN/Daily Mail
  • XSUM
  • ARC
  • HellaSwag
  • OpenBookQA

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • KD effectiveness depends on teacher calibration; a hallucinating teacher can transfer errors.
  • Experiments limited to instruction finetuning on Dolly, not full pretraining KD.
  • Study focuses on faithfulness hallucination; factuality (real-world correctness) was not systematically tested.
  • KD adds compute overhead for teacher inference during training.

When Not To Use

  • When you lack a reliable teacher model or cannot afford teacher inference.
  • If your primary problem is factuality errors tied to parametric knowledge rather than faithfulness to inputs.
  • When human evaluation or alternative metrics are required but unavailable.

Failure Modes

  • Student inherits teacher hallucinations if the teacher is incorrect or biased.
  • Metric mismatch: improvements in ROUGE-L can mask factual defects detected by attention-based metrics.
  • Extra compute cost may make KD impractical for very large teacher ensembles.

Core Entities

Models

  • Llama-2-7B
  • Llama-2-13B
  • Llama-3.1-8B
  • Llama-3.1-70B
  • Qwen-2.5-7B
  • Qwen-2.5-32B
  • Mistral-7B
  • Pythia-6.9B
  • Bloomz-560M
  • OPT-350M
  • mt0-580M
  • Pythia-1B

Metrics

  • ROUGE-L
  • Factual Consistency (Vectara classifier)
  • Factual Rate (LookbackLens attention metric)
  • CHRF
  • BERTScore
  • METEOR
  • Accuracy

Datasets

  • Dolly
  • CNN/Daily Mail
  • XSUM
  • ARC
  • HellaSwag
  • OpenBookQA
  • CommonsenseQA
  • PubMedQA
  • DialogSum
  • HotpotQA
  • TruthfulQA

Benchmarks

  • hallucination leaderboard
  • lmevaluation-harness