Teach LLMs to spot and avoid context-based hallucinations by masking retrieval heads and contrastive tuning

January 23, 20258 min

Overview

Decision SnapshotReady For Pilot

Method is practical for teams that can fine-tune models; gains are demonstrated on a controlled multi-dataset benchmark and validated by human evaluation.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yangfan Ye, Weihong Zhong, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Bing Qin

Links

Abstract / PDF

Why It Matters For Business

RHIO reduces unsupported statements in long answers by teaching models to recognize unfaithful outputs. That lowers risk in customer-facing information systems and can match or beat stronger black‑box models on groundedness for contexts that contain the needed facts.

Who Should Care

Summary TLDR

The authors introduce RHIO, a training and decoding recipe that (1) creates realistic unfaithful examples by masking "retrieval heads" (attention heads that copy from context), (2) fine-tunes models with control tokens [POS]/[NEG] so they learn to produce faithful vs unfaithful outputs, and (3) uses a contrastive self-induced decoding (SID) step at inference to amplify the faithful output. They compile GroundBench (5 LFQA datasets) for evaluation. RHIO improves average faithfulness of Llama-2 7B/13B by ~+9.4 absolute points (≈+12.8% relative) versus supervised finetuning and slightly outperforms GPT-4o on human faithfulness scores on the evaluated benchmark.

Problem Statement

Retrieval-augmented LLMs often produce long answers that mix supported facts with hallucinations. Current fixes (denoising, self-reflection, context-aware decoding) compensate for errors but do not teach models to recognize and avoid unfaithful generations. The paper asks: can we create realistic unfaithful examples and train models to explicitly distinguish faithful from unfaithful outputs to reduce hallucination?

Main Contribution

Identify a mechanistic link between retrieval heads (special attention heads that copy from context) and contextual faithfulness; masking these heads reproduces common unfaithful error patterns.

Propose RHIO: (a) mask retrieval heads to generate realistic negative (unfaithful) examples, (b) Faithfulness-Aware Tuning (FAT) with [POS]/[NEG] control tokens, (c) Self-Induced Decoding (SID) that contrasts POS/NEG outputs at inference.

Key Findings

Masking retrieval heads sharply reduces faithfulness in generated LFQA answers.

NumbersLlama-2-7B-Chat: faithfulness 80.14% (0 masked) → 35.85% (100 masked).

Practical UseRetrieval heads are central to context copying; generating unfaithful samples by masking them yields realistic model errors you can train on.

Evidence RefAppendix A.1 (random vs retrieval-head masking table)

RHIO substantially increases average faithfulness compared to standard supervised fine-tuning (SFT) on GroundBench.

NumbersRHIO-7B avg faith 82.35% vs SFT 72.98% (+9.37 pp, +12.84% rel); RHIO-13B 83.77% vs SFT 74.40% (+9.37 pp, +12.59% rel).

Practical UseIf you can fine-tune your model, adding masked-head negatives and control-token training gives a clear, repeatable boost in contextual faithfulness on evaluated datasets.

Evidence RefTable 2 & Table 3 (GroundBench main and ablation results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Avg. Faithfulness (GroundBench)82.35%Llama-2-7B + SFT 72.98%+9.37 ppGroundBench (aggregate)Table 2 & Table 3Tables 2–3
Avg. Faithfulness (GroundBench)83.77%Llama-2-13B + SFT 74.40%+9.37 ppGroundBench (aggregate)Table 2 & Table 3Tables 2–3

What To Try In 7 Days

Run retrieval-head detection on your model and mask the top retrieval heads to produce negative examples.

Fine-tune a small Llama-2 style model with paired [POS]/[NEG] prefixes using those negatives plus faithful data.

At inference, add SID (α≈0.2) to contrast POS vs NEG outputs and evaluate faithfulness on a controlled set.

Optimization Features

Infra Optimization
Deepspeed stage 3 used for multi-GPU full fine-tuning
Training Optimization
Faithfulness-aware tuning with control tokensData augmentation via masked retrieval heads
Inference Optimization
Self-Induced Decoding (contrastive decoding with POS/NEG)Tune α (paper finds 0.2 works well)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

GroundBench enforces contexts that contain answers, so RHIO does not address failures where retrieval misses the needed facts.

Experiments primarily use Llama-2 family; transfer to other model families was not shown.

When Not To Use

You cannot fine-tune or run multi-pass decoding (no access to model weights or expensive inference).

Your main problem is retrieval failure rather than model synthesis from available context.

Failure Modes

Overfitting to masked-head error patterns that differ from real-world retrieval failures.

Degraded answer coherence if negative samples are too noisy (too many heads masked).

Core Entities

Models

Llama-2-7BLlama-2-13BLlama-3.1-70B-InstructLlama-3.1-8B-InstructMistral-NeMo-12B-InstructGPT-4oGPT-4o-mini

Metrics

Avg. Faithfulness (MiniCheck Bespoke-MiniCheck-7B)ROUGE-LClaim recall (ELI5-WebGPT)SEMQA (QuoteSum)Human Likert faithfulness/completeness

Datasets

GroundBenchFRONT (training used)ELI5-WebGPTExpertQAHAGRIDCLAPNQQuoteSumNatural Questions (subset for training)

Benchmarks

GroundBench