FSPO: reward-wise RL that checks factuality at each reasoning step to cut hallucinations and boost reasoning

May 30, 20257 min

Overview

Decision SnapshotReady For Pilot

FSPO is a practical reward-shaping recipe: it adds verifiers and token-level reweighting during RL to lower hallucinations. It needs verifier accuracy and extra compute but fits existing RL pipelines.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Junyi Li, Hwee Tou Ng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FSPO reduces step-level hallucinations and raises reasoning accuracy, improving reliability for products that need trustworthy step-by-step explanations such as tutoring, medical assistants, and decision-support.

Who Should Care

Summary TLDR

The authors show that standard reinforcement learning (RL) fine-tuning for chain-of-thought reasoning increases hallucinations. They propose FSPO, a token-level RL method that adds automated step-wise factuality checks (via a verifier) to shape token advantages during training. Across math and hallucination benchmarks with Qwen2.5 and Llama models, FSPO reduces hallucinations and raises reasoning accuracy compared to vanilla RL baselines.

Problem Statement

Outcome-driven RL that rewards only final answers makes reasoning models more likely to produce unsupported or false intermediate steps (hallucinations). Sparse binary rewards create high-variance gradients, force high entropy (more random outputs), and allow spurious local optima where the model is confidently wrong.

Main Contribution

Empirical finding: RL-tuned reasoning models show higher hallucination rates across multiple benchmarks.

Theoretical analysis showing three causes for RL-induced hallucination: high-variance gradient, entropy-driven randomness, and spurious local optima under binary rewards.

Key Findings

Reasoning-focused RL increases hallucination rates versus non-RL models on standard benchmarks.

NumbersHaluEval-QA: open-source baseline 48.0% → FSPO 83.0% (accuracy), TruthfulQA: 38.258.4

Practical UseDon’t assume RL fine-tuning always improves factuality; add step-level checks or alternate rewards when training reasoning models.

Evidence RefTable 1 (HaluEval-QA, TruthfulQA)

FSPO improves math reasoning accuracy substantially over the base model.

NumbersGSM8K Pass@1: Qwen2.5-7B-Base 65.2% → FSPO 89.5%+24.3)

Practical UseApplying FSPO can lift reasoning performance on math benchmarks without sacrificing factuality; useful when accuracy and traceability both matter.

Evidence RefTable 1 (GSM8K)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GSM8K Pass@189.5%Qwen2.5-7B-Base 65.2%+24.3%GSM8KTable 1 shows FSPO (Qwen-Base) 89.5 vs base 65.2Table 1
MATH-500 Pass@175.5%Qwen2.5-7B-Base 35.7%+39.8%MATH-500Table 1 reports FSPO (Qwen-Base) 75.5 vs base 35.7Table 1

What To Try In 7 Days

Run a small FSPO-style fine-tune: add an automated verifier to give token-level rewards on 1–2k domain examples.

Measure hallucination rate before/after using the same judge (TruthfulQA or HaluEval) to quantify change.

Adopt step-level checks in your evaluation pipeline to catch unsupported intermediate claims early.

Optimization Features

Training Optimization
token-level advantage reweightingstep-wise reward shaping

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HotpotQA, 2WikiMultiHopQA, GSM8K, MATH-500, TruthfulQA, HaluEval, HalluQA (public datasets)

Risks & Boundaries

Limitations

Theory focuses on binary (1/0) rewards; extension to arbitrary dense rewards is left to future work.

Experiments use 7B–8B models; behavior on much larger models (14B–32B) is not tested due to compute limits.

When Not To Use

If authoritative evidence sources are not available for your task (FSPO needs evidence to verify steps).

When compute budget cannot afford verifier calls during training.

Failure Modes

Verifier mislabels a correct step as incorrect, causing useful tokens to be penalized.

Over-reliance on verifier leads model to game the verifier’s heuristics rather than learn true facts.

Core Entities

Models

Qwen2.5-7B-BaseQwen2.5-7B-InstructQwen2.5-14BQwen2.5-32BLlama3.1-8B-InstructQwQ-32BDeepSeek-V3DeepSeek-R1R1-Distill-Qwen-7BR1-Distill-Qwen-14BR1-Distill-Qwen-32BR1-Distill-Llama-8B

Metrics

Pass@1hallucination rateAccuracyfactuality score

Datasets

HotpotQA (subset)2WikiMultiHopQA (subset)SimpleRLTruthfulQAHaluEvalHalluQAGSM8KMATH-500AIME 2024AIME 2025

Benchmarks

GSM8KMATH-500AIME 2024AIME 2025TruthfulQAHaluEval-QAHalluQA

Context Entities

Models

GPT-4oGPT-o1DeepSeek-R1 (reference)DeepSeek-V3 (reference)