Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.45
Citation Count
1
Why It Matters For Business
RLFH lowers factual errors with low annotation cost by using the model as its own judge and token-level rewards, making deployed assistants more reliable without heavy human labeling.
Summary TLDR
RLFH is an on-policy training method that makes a language model act as its own judge: it splits outputs into atomic facts, verifies each fact against retrieved documents, scores truthfulness and informativeness, maps those statement-level judgments back to token-level dense rewards, and then runs online RL (PPO). On HotpotQA, SQuADv2 and Biography, RLFH raises FactScore versus bases (e.g., Llama3.1-8B: 0.639 → 0.686) and reduces unverifiable/incorrect statements. Statement-level rewards and using the policy as judge are key contributors. Training runs were small-scale (≈1.5–3 hours on 8-GPU setups).
Problem Statement
Large LMs sometimes fabricate facts (hallucinate). Prior fixes use offline finetuning or external editors and give coarse feedback, which causes distribution shift or misses mixed correct/incorrect answers. We need an online, fine-grained way to teach a model to recognize and correct its own factual errors.
Main Contribution
RLFH: an on-policy self-alignment framework that treats the policy as its own judge to collect real-time, fine-grained feedback.
A pipeline to decompose responses into atomic facts, verify each fact against retrieved documents, score truthfulness and informativeness, and convert statement labels into token-level dense rewards.
Empirical evidence on HotpotQA, SQuADv2, and Biography showing consistent FactScore gains and fewer unverifiable/incorrect statements; statement-level reward and on-policy judgement are important.
Key Findings
RLFH raises overall FactScore for Llama3.1-8B from 0.639 to 0.686 on evaluated benchmarks.
Statement-level rewards outperform coarser reward granularities.
Using the policy model as the on-policy judge matches or exceeds using fixed external judges.
RLFH tends to make models more conservative (fewer statements) while raising per-statement accuracy and informativeness.
Training cost is modest for the reported runs.
Results
Average FactScore (base → RLFH)
Average FactScore (base → RLFH)
Reward granularity ablation (statement vs response)
On-policy judge vs fixed judge
Training runtime
Who Should Care
What To Try In 7 Days
Run statement-level extraction and retrieval for your QA prompts and evaluate with FactScore.
Implement a small on-policy loop: one sample per prompt, self-verify statements, map to token rewards, and run a short PPO fine-tune on a development set.
Tune informativeness vs truthfulness weights to recover helpfulness if the model becomes too conservative.
Agent Features
Tool Use
- retrieval (Wikipedia) for verification
Frameworks
- on-policy self-assessment (policy as judge)
Optimization Features
Infra Optimization
- reported runs fit small GPU clusters (two 8-GPU nodes or one 8-GPU node)
Training Optimization
- RL
Reproducibility
Code Urls
Data Urls
- HotpotQA, SQuADv2, Biography (FactScore setup), Wikipedia (04/01/2023) - links in paper
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focuses on factual knowledge; does not address other hallucination types (e.g., logical or safety failures).
- Evaluation benchmarks are limited and may not capture all hallucination behaviors.
- Automated self-verification can be wrong, which may mislead on-policy learning.
When Not To Use
- When no reliable external retrieval corpus exists for the domain.
- If you cannot afford GPU time for on-policy RL runs (requires GPU resources).
- When you need maximum answer coverage and cannot accept conservative/shorter replies.
Failure Modes
- Model might learn to validate its own incorrect claims (self-reinforcing errors).
- Reward hacking where the model minimizes output to avoid penalties, reducing usefulness.
- Coverage loss: fewer statements may omit needed information if informativeness weight is low.
Core Entities
Models
- Qwen2.5-7B-Instruct
- Llama3.1-8B-Instruct
- Qwen2.5-72B-Instruct (evaluation judge)
Metrics
- FactScore
- #Cor. (number correct facts)
- #Inc. (number incorrect facts)
- %Res. (response ratio)
Datasets
- HotpotQA
- SQuADv2
- Biography (FactScore setup)
- English Wikipedia (04/01/2023 retrieval corpus)
Benchmarks
- FactScore

