Overview
Method is practical for teams that can fine-tune models; gains are demonstrated on a controlled multi-dataset benchmark and validated by human evaluation.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
RHIO reduces unsupported statements in long answers by teaching models to recognize unfaithful outputs. That lowers risk in customer-facing information systems and can match or beat stronger black‑box models on groundedness for contexts that contain the needed facts.
Who Should Care
Summary TLDR
The authors introduce RHIO, a training and decoding recipe that (1) creates realistic unfaithful examples by masking "retrieval heads" (attention heads that copy from context), (2) fine-tunes models with control tokens [POS]/[NEG] so they learn to produce faithful vs unfaithful outputs, and (3) uses a contrastive self-induced decoding (SID) step at inference to amplify the faithful output. They compile GroundBench (5 LFQA datasets) for evaluation. RHIO improves average faithfulness of Llama-2 7B/13B by ~+9.4 absolute points (≈+12.8% relative) versus supervised finetuning and slightly outperforms GPT-4o on human faithfulness scores on the evaluated benchmark.
Problem Statement
Retrieval-augmented LLMs often produce long answers that mix supported facts with hallucinations. Current fixes (denoising, self-reflection, context-aware decoding) compensate for errors but do not teach models to recognize and avoid unfaithful generations. The paper asks: can we create realistic unfaithful examples and train models to explicitly distinguish faithful from unfaithful outputs to reduce hallucination?
Main Contribution
Identify a mechanistic link between retrieval heads (special attention heads that copy from context) and contextual faithfulness; masking these heads reproduces common unfaithful error patterns.
Propose RHIO: (a) mask retrieval heads to generate realistic negative (unfaithful) examples, (b) Faithfulness-Aware Tuning (FAT) with [POS]/[NEG] control tokens, (c) Self-Induced Decoding (SID) that contrasts POS/NEG outputs at inference.
Key Findings
Masking retrieval heads sharply reduces faithfulness in generated LFQA answers.
RHIO substantially increases average faithfulness compared to standard supervised fine-tuning (SFT) on GroundBench.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Avg. Faithfulness (GroundBench) | 82.35% | Llama-2-7B + SFT 72.98% | +9.37 pp | GroundBench (aggregate) | Table 2 & Table 3 | Tables 2–3 |
| Avg. Faithfulness (GroundBench) | 83.77% | Llama-2-13B + SFT 74.40% | +9.37 pp | GroundBench (aggregate) | Table 2 & Table 3 | Tables 2–3 |
What To Try In 7 Days
Run retrieval-head detection on your model and mask the top retrieval heads to produce negative examples.
Fine-tune a small Llama-2 style model with paired [POS]/[NEG] prefixes using those negatives plus faithful data.
At inference, add SID (α≈0.2) to contrast POS vs NEG outputs and evaluate faithfulness on a controlled set.
Optimization Features
Infra Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
GroundBench enforces contexts that contain answers, so RHIO does not address failures where retrieval misses the needed facts.
Experiments primarily use Llama-2 family; transfer to other model families was not shown.
When Not To Use
You cannot fine-tune or run multi-pass decoding (no access to model weights or expensive inference).
Your main problem is retrieval failure rather than model synthesis from available context.
Failure Modes
Overfitting to masked-head error patterns that differ from real-world retrieval failures.
Degraded answer coherence if negative samples are too noisy (too many heads masked).

