Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Overview

Decision SnapshotReady For Pilot

The method shows repeatable gains on three datasets with automated and human-validated metrics; key strengths are fine-grained rewards and on-policy self-judging, but risks include automated judge errors and reduced helpfulness unless weights are tuned.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 70%

Novelty: 60%

Authors

Xueru Wen, Jie Lou, Xinyu Lu, Ji Yuqiu, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, Le Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RLFH lowers factual errors with low annotation cost by using the model as its own judge and token-level rewards, making deployed assistants more reliable without heavy human labeling.

Who Should Care

ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

RLFH is an on-policy training method that makes a language model act as its own judge: it splits outputs into atomic facts, verifies each fact against retrieved documents, scores truthfulness and informativeness, maps those statement-level judgments back to token-level dense rewards, and then runs online RL (PPO). On HotpotQA, SQuADv2 and Biography, RLFH raises FactScore versus bases (e.g., Llama3.1-8B: 0.639 → 0.686) and reduces unverifiable/incorrect statements. Statement-level rewards and using the policy as judge are key contributors. Training runs were small-scale (≈1.5–3 hours on 8-GPU setups).

Problem Statement

Large LMs sometimes fabricate facts (hallucinate). Prior fixes use offline finetuning or external editors and give coarse feedback, which causes distribution shift or misses mixed correct/incorrect answers. We need an online, fine-grained way to teach a model to recognize and correct its own factual errors.

Main Contribution

RLFH: an on-policy self-alignment framework that treats the policy as its own judge to collect real-time, fine-grained feedback.

A pipeline to decompose responses into atomic facts, verify each fact against retrieved documents, score truthfulness and informativeness, and convert statement labels into token-level dense rewards.

Key Findings

RLFH raises overall FactScore for Llama3.1-8B from 0.639 to 0.686 on evaluated benchmarks.

NumbersAvg FactScore 0.639 → 0.686 (Δ +0.047)

Practical UseExpect modest but clear factuality gains by applying RLFH to Llama3.1-8B-scale models on QA-style tasks.

Evidence RefTable 1, Table 3

Statement-level rewards outperform coarser reward granularities.

NumbersQwen2.5-7B: response 0.651 → statement 0.668 (Δ +0.017); Llama3.1-8B: 0.647 → 0.686 (Δ +0.039)

Practical UseWhen designing reward signals, map judgments to the statement/token level rather than a single response score to get better factuality.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average FactScore (base → RLFH)	Qwen2.5-7B: 0.638 → 0.668	Qwen2.5-7B base 0.638	+0.030	Average across HotpotQA, SQuADv2, Biography	Table 1 (Rows: Qwen2.5-7B vs RLFH Qwen2.5-7B)	Table 1
Average FactScore (base → RLFH)	Llama3.1-8B: 0.639 → 0.686	Llama3.1-8B base 0.639	+0.047	Average across HotpotQA, SQuADv2, Biography	Table 1 (Rows: Llama3.1-8B vs RLFH Llama3.1-8B)	Table 1

What To Try In 7 Days

Run statement-level extraction and retrieval for your QA prompts and evaluate with FactScore.

Implement a small on-policy loop: one sample per prompt, self-verify statements, map to token rewards, and run a short PPO fine-tune on a development set.

Tune informativeness vs truthfulness weights to recover helpfulness if the model becomes too conservative.

Agent Features

Tool Use

retrieval (Wikipedia) for verification

Frameworks

on-policy self-assessment (policy as judge)

Optimization Features

Infra Optimization

reported runs fit small GPU clusters (two 8-GPU nodes or one 8-GPU node)

Training Optimization

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/AlignRM/RLFH

Data URLs

HotpotQA, SQuADv2, Biography (FactScore setup), Wikipedia (04/01/2023) - links in paper

Risks & Boundaries

Limitations

Focuses on factual knowledge; does not address other hallucination types (e.g., logical or safety failures).

Evaluation benchmarks are limited and may not capture all hallucination behaviors.

When Not To Use

When no reliable external retrieval corpus exists for the domain.

If you cannot afford GPU time for on-policy RL runs (requires GPU resources).

Failure Modes

Model might learn to validate its own incorrect claims (self-reinforcing errors).

Reward hacking where the model minimizes output to avoid penalties, reducing usefulness.

Core Entities

Models

Qwen2.5-7B-InstructLlama3.1-8B-InstructQwen2.5-72B-Instruct (evaluation judge)

Metrics

FactScore#Cor. (number correct facts)#Inc. (number incorrect facts)%Res. (response ratio)

Datasets

HotpotQASQuADv2Biography (FactScore setup)English Wikipedia (04/01/2023 retrieval corpus)

Benchmarks

FactScore

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RLFH raises overall FactScore for Llama3.1-8B from 0.639 to 0.686 on evaluated benchmarks.

Statement-level rewards outperform coarser reward granularities.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding