Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

June 18, 20248 min

Overview

Decision SnapshotReady For Pilot

The method shows repeatable gains on three datasets with automated and human-validated metrics; key strengths are fine-grained rewards and on-policy self-judging, but risks include automated judge errors and reduced helpfulness unless weights are tuned.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 70%

Novelty: 60%

Authors

Xueru Wen, Jie Lou, Xinyu Lu, Ji Yuqiu, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, Le Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RLFH lowers factual errors with low annotation cost by using the model as its own judge and token-level rewards, making deployed assistants more reliable without heavy human labeling.

Who Should Care

Summary TLDR

RLFH is an on-policy training method that makes a language model act as its own judge: it splits outputs into atomic facts, verifies each fact against retrieved documents, scores truthfulness and informativeness, maps those statement-level judgments back to token-level dense rewards, and then runs online RL (PPO). On HotpotQA, SQuADv2 and Biography, RLFH raises FactScore versus bases (e.g., Llama3.1-8B: 0.639 → 0.686) and reduces unverifiable/incorrect statements. Statement-level rewards and using the policy as judge are key contributors. Training runs were small-scale (≈1.5–3 hours on 8-GPU setups).

Problem Statement

Large LMs sometimes fabricate facts (hallucinate). Prior fixes use offline finetuning or external editors and give coarse feedback, which causes distribution shift or misses mixed correct/incorrect answers. We need an online, fine-grained way to teach a model to recognize and correct its own factual errors.

Main Contribution

RLFH: an on-policy self-alignment framework that treats the policy as its own judge to collect real-time, fine-grained feedback.

A pipeline to decompose responses into atomic facts, verify each fact against retrieved documents, score truthfulness and informativeness, and convert statement labels into token-level dense rewards.

Key Findings

RLFH raises overall FactScore for Llama3.1-8B from 0.639 to 0.686 on evaluated benchmarks.

NumbersAvg FactScore 0.6390.686+0.047)

Practical UseExpect modest but clear factuality gains by applying RLFH to Llama3.1-8B-scale models on QA-style tasks.

Evidence RefTable 1, Table 3

Statement-level rewards outperform coarser reward granularities.

NumbersQwen2.5-7B: response 0.651 → statement 0.668+0.017); Llama3.1-8B: 0.6470.686+0.039)

Practical UseWhen designing reward signals, map judgments to the statement/token level rather than a single response score to get better factuality.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average FactScore (base → RLFH)Qwen2.5-7B: 0.6380.668Qwen2.5-7B base 0.638+0.030Average across HotpotQA, SQuADv2, BiographyTable 1 (Rows: Qwen2.5-7B vs RLFH Qwen2.5-7B)Table 1
Average FactScore (base → RLFH)Llama3.1-8B: 0.6390.686Llama3.1-8B base 0.639+0.047Average across HotpotQA, SQuADv2, BiographyTable 1 (Rows: Llama3.1-8B vs RLFH Llama3.1-8B)Table 1

What To Try In 7 Days

Run statement-level extraction and retrieval for your QA prompts and evaluate with FactScore.

Implement a small on-policy loop: one sample per prompt, self-verify statements, map to token rewards, and run a short PPO fine-tune on a development set.

Tune informativeness vs truthfulness weights to recover helpfulness if the model becomes too conservative.

Agent Features

Tool Use
retrieval (Wikipedia) for verification
Frameworks
on-policy self-assessment (policy as judge)

Optimization Features

Infra Optimization
reported runs fit small GPU clusters (two 8-GPU nodes or one 8-GPU node)
Training Optimization
RL

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HotpotQA, SQuADv2, Biography (FactScore setup), Wikipedia (04/01/2023) - links in paper

Risks & Boundaries

Limitations

Focuses on factual knowledge; does not address other hallucination types (e.g., logical or safety failures).

Evaluation benchmarks are limited and may not capture all hallucination behaviors.

When Not To Use

When no reliable external retrieval corpus exists for the domain.

If you cannot afford GPU time for on-policy RL runs (requires GPU resources).

Failure Modes

Model might learn to validate its own incorrect claims (self-reinforcing errors).

Reward hacking where the model minimizes output to avoid penalties, reducing usefulness.

Core Entities

Models

Qwen2.5-7B-InstructLlama3.1-8B-InstructQwen2.5-72B-Instruct (evaluation judge)

Metrics

FactScore#Cor. (number correct facts)#Inc. (number incorrect facts)%Res. (response ratio)

Datasets

HotpotQASQuADv2Biography (FactScore setup)English Wikipedia (04/01/2023 retrieval corpus)

Benchmarks

FactScore