Teach LLMs to spot and avoid context-based hallucinations by masking retrieval heads and contrastive tuning

Overview

Decision SnapshotReady For Pilot

Method is practical for teams that can fine-tune models; gains are demonstrated on a controlled multi-dataset benchmark and validated by human evaluation.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yangfan Ye, Weihong Zhong, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Bing Qin

Links

Abstract / PDF

Why It Matters For Business

RHIO reduces unsupported statements in long answers by teaching models to recognize unfaithful outputs. That lowers risk in customer-facing information systems and can match or beat stronger black‑box models on groundedness for contexts that contain the needed facts.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

The authors introduce RHIO, a training and decoding recipe that (1) creates realistic unfaithful examples by masking "retrieval heads" (attention heads that copy from context), (2) fine-tunes models with control tokens [POS]/[NEG] so they learn to produce faithful vs unfaithful outputs, and (3) uses a contrastive self-induced decoding (SID) step at inference to amplify the faithful output. They compile GroundBench (5 LFQA datasets) for evaluation. RHIO improves average faithfulness of Llama-2 7B/13B by ~+9.4 absolute points (≈+12.8% relative) versus supervised finetuning and slightly outperforms GPT-4o on human faithfulness scores on the evaluated benchmark.

Problem Statement

Retrieval-augmented LLMs often produce long answers that mix supported facts with hallucinations. Current fixes (denoising, self-reflection, context-aware decoding) compensate for errors but do not teach models to recognize and avoid unfaithful generations. The paper asks: can we create realistic unfaithful examples and train models to explicitly distinguish faithful from unfaithful outputs to reduce hallucination?

Main Contribution

Identify a mechanistic link between retrieval heads (special attention heads that copy from context) and contextual faithfulness; masking these heads reproduces common unfaithful error patterns.

Propose RHIO: (a) mask retrieval heads to generate realistic negative (unfaithful) examples, (b) Faithfulness-Aware Tuning (FAT) with [POS]/[NEG] control tokens, (c) Self-Induced Decoding (SID) that contrasts POS/NEG outputs at inference.

Key Findings

Masking retrieval heads sharply reduces faithfulness in generated LFQA answers.

NumbersLlama-2-7B-Chat: faithfulness 80.14% (0 masked) → 35.85% (100 masked).

Practical UseRetrieval heads are central to context copying; generating unfaithful samples by masking them yields realistic model errors you can train on.

Evidence RefAppendix A.1 (random vs retrieval-head masking table)

RHIO substantially increases average faithfulness compared to standard supervised fine-tuning (SFT) on GroundBench.

NumbersRHIO-7B avg faith 82.35% vs SFT 72.98% (+9.37 pp, +12.84% rel); RHIO-13B 83.77% vs SFT 74.40% (+9.37 pp, +12.59% rel).

Practical UseIf you can fine-tune your model, adding masked-head negatives and control-token training gives a clear, repeatable boost in contextual faithfulness on evaluated datasets.

Evidence RefTable 2 & Table 3 (GroundBench main and ablation results)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Avg. Faithfulness (GroundBench)	82.35%	Llama-2-7B + SFT 72.98%	+9.37 pp	GroundBench (aggregate)	Table 2 & Table 3	Tables 2–3
Avg. Faithfulness (GroundBench)	83.77%	Llama-2-13B + SFT 74.40%	+9.37 pp	GroundBench (aggregate)	Table 2 & Table 3	Tables 2–3

What To Try In 7 Days

Run retrieval-head detection on your model and mask the top retrieval heads to produce negative examples.

Fine-tune a small Llama-2 style model with paired [POS]/[NEG] prefixes using those negatives plus faithful data.

At inference, add SID (α≈0.2) to contrast POS vs NEG outputs and evaluate faithfulness on a controlled set.

Optimization Features

Infra Optimization

Deepspeed stage 3 used for multi-GPU full fine-tuning

Training Optimization

Faithfulness-aware tuning with control tokensData augmentation via masked retrieval heads

Inference Optimization

Self-Induced Decoding (contrastive decoding with POS/NEG)Tune α (paper finds 0.2 works well)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

GroundBench enforces contexts that contain answers, so RHIO does not address failures where retrieval misses the needed facts.

Experiments primarily use Llama-2 family; transfer to other model families was not shown.

When Not To Use

You cannot fine-tune or run multi-pass decoding (no access to model weights or expensive inference).

Your main problem is retrieval failure rather than model synthesis from available context.

Failure Modes

Overfitting to masked-head error patterns that differ from real-world retrieval failures.

Degraded answer coherence if negative samples are too noisy (too many heads masked).

Core Entities

Models

Llama-2-7BLlama-2-13BLlama-3.1-70B-InstructLlama-3.1-8B-InstructMistral-NeMo-12B-InstructGPT-4oGPT-4o-mini

Metrics

Avg. Faithfulness (MiniCheck Bespoke-MiniCheck-7B)ROUGE-LClaim recall (ELI5-WebGPT)SEMQA (QuoteSum)Human Likert faithfulness/completeness

Datasets

GroundBenchFRONT (training used)ELI5-WebGPTExpertQAHAGRIDCLAPNQQuoteSumNatural Questions (subset for training)

Benchmarks

GroundBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Masking retrieval heads sharply reduces faithfulness in generated LFQA answers.

RHIO substantially increases average faithfulness compared to standard supervised fine-tuning (SFT) on GroundBench.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding