Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

June 18, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

1

Authors

Xueru Wen, Jie Lou, Xinyu Lu, Ji Yuqiu, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, Le Sun

Links

Abstract / PDF

Why It Matters For Business

RLFH lowers factual errors with low annotation cost by using the model as its own judge and token-level rewards, making deployed assistants more reliable without heavy human labeling.

Summary TLDR

RLFH is an on-policy training method that makes a language model act as its own judge: it splits outputs into atomic facts, verifies each fact against retrieved documents, scores truthfulness and informativeness, maps those statement-level judgments back to token-level dense rewards, and then runs online RL (PPO). On HotpotQA, SQuADv2 and Biography, RLFH raises FactScore versus bases (e.g., Llama3.1-8B: 0.639 → 0.686) and reduces unverifiable/incorrect statements. Statement-level rewards and using the policy as judge are key contributors. Training runs were small-scale (≈1.5–3 hours on 8-GPU setups).

Problem Statement

Large LMs sometimes fabricate facts (hallucinate). Prior fixes use offline finetuning or external editors and give coarse feedback, which causes distribution shift or misses mixed correct/incorrect answers. We need an online, fine-grained way to teach a model to recognize and correct its own factual errors.

Main Contribution

RLFH: an on-policy self-alignment framework that treats the policy as its own judge to collect real-time, fine-grained feedback.

A pipeline to decompose responses into atomic facts, verify each fact against retrieved documents, score truthfulness and informativeness, and convert statement labels into token-level dense rewards.

Empirical evidence on HotpotQA, SQuADv2, and Biography showing consistent FactScore gains and fewer unverifiable/incorrect statements; statement-level reward and on-policy judgement are important.

Key Findings

RLFH raises overall FactScore for Llama3.1-8B from 0.639 to 0.686 on evaluated benchmarks.

NumbersAvg FactScore 0.639 → 0.686 (Δ +0.047)

Statement-level rewards outperform coarser reward granularities.

NumbersQwen2.5-7B: response 0.651 → statement 0.668 (Δ +0.017); Llama3.1-8B: 0.647 → 0.686 (Δ +0.039)

Using the policy model as the on-policy judge matches or exceeds using fixed external judges.

NumbersLlama3.1-8B on-policy avg=0.686 vs fixed judges lower (Table 3)

RLFH tends to make models more conservative (fewer statements) while raising per-statement accuracy and informativeness.

NumbersDecrease in %Res. and increase in high-accuracy responses (Figures 4–7; Table 1 stats)

Training cost is modest for the reported runs.

NumbersTypical run <1.5 hours on two 8-GPU nodes or ≈3 hours on a single 8-GPU node

Results

Average FactScore (base → RLFH)

ValueQwen2.5-7B: 0.638 → 0.668

BaselineQwen2.5-7B base 0.638

Average FactScore (base → RLFH)

ValueLlama3.1-8B: 0.639 → 0.686

BaselineLlama3.1-8B base 0.639

Reward granularity ablation (statement vs response)

ValueQwen2.5-7B: response 0.651, sentence 0.655, statement 0.668

BaselineQwen2.5-7B response-level 0.651

On-policy judge vs fixed judge

ValueLlama3.1-8B on-policy 0.686 (highest vs fixed judges)

BaselineLlama3.1-8B with fixed judges (lower scores in Table 3)

Training runtime

Value<1.5 hours on two 8-GPU nodes; ≈3 hours on one 8-GPU node

BaselineNot directly compared

Who Should Care

What To Try In 7 Days

Run statement-level extraction and retrieval for your QA prompts and evaluate with FactScore.

Implement a small on-policy loop: one sample per prompt, self-verify statements, map to token rewards, and run a short PPO fine-tune on a development set.

Tune informativeness vs truthfulness weights to recover helpfulness if the model becomes too conservative.

Agent Features

Tool Use

  • retrieval (Wikipedia) for verification

Frameworks

  • on-policy self-assessment (policy as judge)

Optimization Features

Infra Optimization

  • reported runs fit small GPU clusters (two 8-GPU nodes or one 8-GPU node)

Training Optimization

  • RL

Reproducibility

Data Urls

  • HotpotQA, SQuADv2, Biography (FactScore setup), Wikipedia (04/01/2023) - links in paper

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focuses on factual knowledge; does not address other hallucination types (e.g., logical or safety failures).
  • Evaluation benchmarks are limited and may not capture all hallucination behaviors.
  • Automated self-verification can be wrong, which may mislead on-policy learning.

When Not To Use

  • When no reliable external retrieval corpus exists for the domain.
  • If you cannot afford GPU time for on-policy RL runs (requires GPU resources).
  • When you need maximum answer coverage and cannot accept conservative/shorter replies.

Failure Modes

  • Model might learn to validate its own incorrect claims (self-reinforcing errors).
  • Reward hacking where the model minimizes output to avoid penalties, reducing usefulness.
  • Coverage loss: fewer statements may omit needed information if informativeness weight is low.

Core Entities

Models

  • Qwen2.5-7B-Instruct
  • Llama3.1-8B-Instruct
  • Qwen2.5-72B-Instruct (evaluation judge)

Metrics

  • FactScore
  • #Cor. (number correct facts)
  • #Inc. (number incorrect facts)
  • %Res. (response ratio)

Datasets

  • HotpotQA
  • SQuADv2
  • Biography (FactScore setup)
  • English Wikipedia (04/01/2023 retrieval corpus)

Benchmarks

  • FactScore