Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
RHIO reduces unsupported statements in long answers by teaching models to recognize unfaithful outputs. That lowers risk in customer-facing information systems and can match or beat stronger black‑box models on groundedness for contexts that contain the needed facts.
Summary TLDR
The authors introduce RHIO, a training and decoding recipe that (1) creates realistic unfaithful examples by masking "retrieval heads" (attention heads that copy from context), (2) fine-tunes models with control tokens [POS]/[NEG] so they learn to produce faithful vs unfaithful outputs, and (3) uses a contrastive self-induced decoding (SID) step at inference to amplify the faithful output. They compile GroundBench (5 LFQA datasets) for evaluation. RHIO improves average faithfulness of Llama-2 7B/13B by ~+9.4 absolute points (≈+12.8% relative) versus supervised finetuning and slightly outperforms GPT-4o on human faithfulness scores on the evaluated benchmark.
Problem Statement
Retrieval-augmented LLMs often produce long answers that mix supported facts with hallucinations. Current fixes (denoising, self-reflection, context-aware decoding) compensate for errors but do not teach models to recognize and avoid unfaithful generations. The paper asks: can we create realistic unfaithful examples and train models to explicitly distinguish faithful from unfaithful outputs to reduce hallucination?
Main Contribution
Identify a mechanistic link between retrieval heads (special attention heads that copy from context) and contextual faithfulness; masking these heads reproduces common unfaithful error patterns.
Propose RHIO: (a) mask retrieval heads to generate realistic negative (unfaithful) examples, (b) Faithfulness-Aware Tuning (FAT) with [POS]/[NEG] control tokens, (c) Self-Induced Decoding (SID) that contrasts POS/NEG outputs at inference.
Release GroundBench, an evaluation suite assembled from five LFQA datasets (ELI5-WebGPT, ExpertQA, HAGRID, CLAPNQ, QuoteSum) with controlled contexts that contain sufficient evidence.
Extensive evaluation showing RHIO raises average faithfulness for Llama-2 7B to 82.35% and 13B to 83.77% and improves human-rated fully-supported answers.
Key Findings
Masking retrieval heads sharply reduces faithfulness in generated LFQA answers.
RHIO substantially increases average faithfulness compared to standard supervised fine-tuning (SFT) on GroundBench.
Self-induced decoding (SID) further improves faithfulness beyond FAT training.
Human evaluation shows RHIO-13B yields slightly more fully-supported answers than GPT-4o on the sampled outputs.
Results
Avg. Faithfulness (GroundBench)
Avg. Faithfulness (GroundBench)
Human-rated 'fully supported' answers
Faithfulness after masking retrieval heads
Who Should Care
What To Try In 7 Days
Run retrieval-head detection on your model and mask the top retrieval heads to produce negative examples.
Fine-tune a small Llama-2 style model with paired [POS]/[NEG] prefixes using those negatives plus faithful data.
At inference, add SID (α≈0.2) to contrast POS vs NEG outputs and evaluate faithfulness on a controlled set.
Optimization Features
Infra Optimization
- Deepspeed stage 3 used for multi-GPU full fine-tuning
Training Optimization
- Faithfulness-aware tuning with control tokens
- Data augmentation via masked retrieval heads
Inference Optimization
- Self-Induced Decoding (contrastive decoding with POS/NEG)
- Tune α (paper finds 0.2 works well)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- GroundBench enforces contexts that contain answers, so RHIO does not address failures where retrieval misses the needed facts.
- Experiments primarily use Llama-2 family; transfer to other model families was not shown.
- Masking many retrieval heads can produce low-quality negatives; RHIO does not control generation of specific error types.
When Not To Use
- You cannot fine-tune or run multi-pass decoding (no access to model weights or expensive inference).
- Your main problem is retrieval failure rather than model synthesis from available context.
- You need guarantees across out-of-distribution sources not covered by GroundBench.
Failure Modes
- Overfitting to masked-head error patterns that differ from real-world retrieval failures.
- Degraded answer coherence if negative samples are too noisy (too many heads masked).
- SID hyperparameter (α) sensitivity may reduce quality if tuned improperly.
Core Entities
Models
- Llama-2-7B
- Llama-2-13B
- Llama-3.1-70B-Instruct
- Llama-3.1-8B-Instruct
- Mistral-NeMo-12B-Instruct
- GPT-4o
- GPT-4o-mini
Metrics
- Avg. Faithfulness (MiniCheck Bespoke-MiniCheck-7B)
- ROUGE-L
- Claim recall (ELI5-WebGPT)
- SEMQA (QuoteSum)
- Human Likert faithfulness/completeness
Datasets
- GroundBench
- FRONT (training used)
- ELI5-WebGPT
- ExpertQA
- HAGRID
- CLAPNQ
- QuoteSum
- Natural Questions (subset for training)
Benchmarks
- GroundBench

