Teach LLMs to spot and avoid context-based hallucinations by masking retrieval heads and contrastive tuning

January 23, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yangfan Ye, Weihong Zhong, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Bing Qin

Links

Abstract / PDF

Why It Matters For Business

RHIO reduces unsupported statements in long answers by teaching models to recognize unfaithful outputs. That lowers risk in customer-facing information systems and can match or beat stronger black‑box models on groundedness for contexts that contain the needed facts.

Summary TLDR

The authors introduce RHIO, a training and decoding recipe that (1) creates realistic unfaithful examples by masking "retrieval heads" (attention heads that copy from context), (2) fine-tunes models with control tokens [POS]/[NEG] so they learn to produce faithful vs unfaithful outputs, and (3) uses a contrastive self-induced decoding (SID) step at inference to amplify the faithful output. They compile GroundBench (5 LFQA datasets) for evaluation. RHIO improves average faithfulness of Llama-2 7B/13B by ~+9.4 absolute points (≈+12.8% relative) versus supervised finetuning and slightly outperforms GPT-4o on human faithfulness scores on the evaluated benchmark.

Problem Statement

Retrieval-augmented LLMs often produce long answers that mix supported facts with hallucinations. Current fixes (denoising, self-reflection, context-aware decoding) compensate for errors but do not teach models to recognize and avoid unfaithful generations. The paper asks: can we create realistic unfaithful examples and train models to explicitly distinguish faithful from unfaithful outputs to reduce hallucination?

Main Contribution

Identify a mechanistic link between retrieval heads (special attention heads that copy from context) and contextual faithfulness; masking these heads reproduces common unfaithful error patterns.

Propose RHIO: (a) mask retrieval heads to generate realistic negative (unfaithful) examples, (b) Faithfulness-Aware Tuning (FAT) with [POS]/[NEG] control tokens, (c) Self-Induced Decoding (SID) that contrasts POS/NEG outputs at inference.

Release GroundBench, an evaluation suite assembled from five LFQA datasets (ELI5-WebGPT, ExpertQA, HAGRID, CLAPNQ, QuoteSum) with controlled contexts that contain sufficient evidence.

Extensive evaluation showing RHIO raises average faithfulness for Llama-2 7B to 82.35% and 13B to 83.77% and improves human-rated fully-supported answers.

Key Findings

Masking retrieval heads sharply reduces faithfulness in generated LFQA answers.

NumbersLlama-2-7B-Chat: faithfulness 80.14% (0 masked) → 35.85% (100 masked).

RHIO substantially increases average faithfulness compared to standard supervised fine-tuning (SFT) on GroundBench.

NumbersRHIO-7B avg faith 82.35% vs SFT 72.98% (+9.37 pp, +12.84% rel); RHIO-13B 83.77% vs SFT 74.40% (+9.37 pp, +12.59% rel).

Self-induced decoding (SID) further improves faithfulness beyond FAT training.

NumbersAuthors report SID improves average faithfulness by 2.90% (7B) and 4.17% (13B); ablations show RHIO > w/o SID in Table 3

Human evaluation shows RHIO-13B yields slightly more fully-supported answers than GPT-4o on the sampled outputs.

NumbersHuman-rated 'fully supported' rate: RHIO-13B 87.5% vs GPT-4o 86.5% (on 200 generations sampled).

Results

Avg. Faithfulness (GroundBench)

Value82.35%

BaselineLlama-2-7B + SFT 72.98%

Avg. Faithfulness (GroundBench)

Value83.77%

BaselineLlama-2-13B + SFT 74.40%

Human-rated 'fully supported' answers

Value87.5% (RHIO-13B)

BaselineGPT-4o 86.5%

Faithfulness after masking retrieval heads

Value35.85% (Llama-2-7B-Chat, 100 heads masked)

Baseline80.14% (0 masked)

Who Should Care

What To Try In 7 Days

Run retrieval-head detection on your model and mask the top retrieval heads to produce negative examples.

Fine-tune a small Llama-2 style model with paired [POS]/[NEG] prefixes using those negatives plus faithful data.

At inference, add SID (α≈0.2) to contrast POS vs NEG outputs and evaluate faithfulness on a controlled set.

Optimization Features

Infra Optimization

  • Deepspeed stage 3 used for multi-GPU full fine-tuning

Training Optimization

  • Faithfulness-aware tuning with control tokens
  • Data augmentation via masked retrieval heads

Inference Optimization

  • Self-Induced Decoding (contrastive decoding with POS/NEG)
  • Tune α (paper finds 0.2 works well)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • GroundBench enforces contexts that contain answers, so RHIO does not address failures where retrieval misses the needed facts.
  • Experiments primarily use Llama-2 family; transfer to other model families was not shown.
  • Masking many retrieval heads can produce low-quality negatives; RHIO does not control generation of specific error types.

When Not To Use

  • You cannot fine-tune or run multi-pass decoding (no access to model weights or expensive inference).
  • Your main problem is retrieval failure rather than model synthesis from available context.
  • You need guarantees across out-of-distribution sources not covered by GroundBench.

Failure Modes

  • Overfitting to masked-head error patterns that differ from real-world retrieval failures.
  • Degraded answer coherence if negative samples are too noisy (too many heads masked).
  • SID hyperparameter (α) sensitivity may reduce quality if tuned improperly.

Core Entities

Models

  • Llama-2-7B
  • Llama-2-13B
  • Llama-3.1-70B-Instruct
  • Llama-3.1-8B-Instruct
  • Mistral-NeMo-12B-Instruct
  • GPT-4o
  • GPT-4o-mini

Metrics

  • Avg. Faithfulness (MiniCheck Bespoke-MiniCheck-7B)
  • ROUGE-L
  • Claim recall (ELI5-WebGPT)
  • SEMQA (QuoteSum)
  • Human Likert faithfulness/completeness

Datasets

  • GroundBench
  • FRONT (training used)
  • ELI5-WebGPT
  • ExpertQA
  • HAGRID
  • CLAPNQ
  • QuoteSum
  • Natural Questions (subset for training)

Benchmarks

  • GroundBench