FSPO: reward-wise RL that checks factuality at each reasoning step to cut hallucinations and boost reasoning

May 30, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Junyi Li, Hwee Tou Ng

Links

Abstract / PDF

Why It Matters For Business

FSPO reduces step-level hallucinations and raises reasoning accuracy, improving reliability for products that need trustworthy step-by-step explanations such as tutoring, medical assistants, and decision-support.

Summary TLDR

The authors show that standard reinforcement learning (RL) fine-tuning for chain-of-thought reasoning increases hallucinations. They propose FSPO, a token-level RL method that adds automated step-wise factuality checks (via a verifier) to shape token advantages during training. Across math and hallucination benchmarks with Qwen2.5 and Llama models, FSPO reduces hallucinations and raises reasoning accuracy compared to vanilla RL baselines.

Problem Statement

Outcome-driven RL that rewards only final answers makes reasoning models more likely to produce unsupported or false intermediate steps (hallucinations). Sparse binary rewards create high-variance gradients, force high entropy (more random outputs), and allow spurious local optima where the model is confidently wrong.

Main Contribution

Empirical finding: RL-tuned reasoning models show higher hallucination rates across multiple benchmarks.

Theoretical analysis showing three causes for RL-induced hallucination: high-variance gradient, entropy-driven randomness, and spurious local optima under binary rewards.

FSPO algorithm: integrate automated step-wise factuality verification into token-level advantage adjustment to reward factual tokens and penalize incorrect ones.

Extensive experiments on math and hallucination benchmarks (Qwen2.5 and Llama backbones) showing FSPO reduces hallucinations while improving or maintaining reasoning scores.

Open-source code: implementation and training recipes released on GitHub.

Key Findings

Reasoning-focused RL increases hallucination rates versus non-RL models on standard benchmarks.

NumbersHaluEval-QA: open-source baseline 48.0% → FSPO 83.0% (accuracy), TruthfulQA: 38.2 → 58.4

FSPO improves math reasoning accuracy substantially over the base model.

NumbersGSM8K Pass@1: Qwen2.5-7B-Base 65.2% → FSPO 89.5% (Δ +24.3)

Adding step-wise factuality stabilizes RL updates by providing denser feedback and non-zero gradients even when final answer is wrong.

FSPO generalizes across RL algorithms and data sizes.

NumbersFSPO works with GRPO and Reinforce++; benefits seen with 1K–4K training samples for hallucination reduction

Results

GSM8K Pass@1

Value89.5%

BaselineQwen2.5-7B-Base 65.2%

MATH-500 Pass@1

Value75.5%

BaselineQwen2.5-7B-Base 35.7%

TruthfulQA (truthful ratio)

Value58.4%

BaselineQwen2.5-7B-Base 38.2%

Accuracy

Value83.0%

BaselineQwen2.5-7B-Base 48.0%

HalluQA (truthful ratio)

Value52.0%

BaselineQwen2.5-7B-Base 39.5%

Who Should Care

What To Try In 7 Days

Run a small FSPO-style fine-tune: add an automated verifier to give token-level rewards on 1–2k domain examples.

Measure hallucination rate before/after using the same judge (TruthfulQA or HaluEval) to quantify change.

Adopt step-level checks in your evaluation pipeline to catch unsupported intermediate claims early.

Optimization Features

Training Optimization

  • token-level advantage reweighting
  • step-wise reward shaping

Reproducibility

Data Urls

  • HotpotQA, 2WikiMultiHopQA, GSM8K, MATH-500, TruthfulQA, HaluEval, HalluQA (public datasets)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Theory focuses on binary (1/0) rewards; extension to arbitrary dense rewards is left to future work.
  • Experiments use 7B–8B models; behavior on much larger models (14B–32B) is not tested due to compute limits.
  • FSPO depends on the quality of the automated verifier; verifier errors can misguide token rewards.

When Not To Use

  • If authoritative evidence sources are not available for your task (FSPO needs evidence to verify steps).
  • When compute budget cannot afford verifier calls during training.
  • For tasks where intermediate steps are not semantically meaningful or are intentionally creative.

Failure Modes

  • Verifier mislabels a correct step as incorrect, causing useful tokens to be penalized.
  • Over-reliance on verifier leads model to game the verifier’s heuristics rather than learn true facts.
  • If evidence coverage is low, step-wise rewards may be sparse and fail to prevent spurious optima.

Core Entities

Models

  • Qwen2.5-7B-Base
  • Qwen2.5-7B-Instruct
  • Qwen2.5-14B
  • Qwen2.5-32B
  • Llama3.1-8B-Instruct
  • QwQ-32B
  • DeepSeek-V3
  • DeepSeek-R1
  • R1-Distill-Qwen-7B
  • R1-Distill-Qwen-14B
  • R1-Distill-Qwen-32B
  • R1-Distill-Llama-8B

Metrics

  • Pass@1
  • hallucination rate
  • Accuracy
  • factuality score

Datasets

  • HotpotQA (subset)
  • 2WikiMultiHopQA (subset)
  • SimpleRL
  • TruthfulQA
  • HaluEval
  • HalluQA
  • GSM8K
  • MATH-500
  • AIME 2024
  • AIME 2025

Benchmarks

  • GSM8K
  • MATH-500
  • AIME 2024
  • AIME 2025
  • TruthfulQA
  • HaluEval-QA
  • HalluQA

Context Entities

Models

  • GPT-4o
  • GPT-o1
  • DeepSeek-R1 (reference)
  • DeepSeek-V3 (reference)