FSPO: reward-wise RL that checks factuality at each reasoning step to cut hallucinations and boost reasoning

Overview

Decision SnapshotReady For Pilot

FSPO is a practical reward-shaping recipe: it adds verifiers and token-level reweighting during RL to lower hallucinations. It needs verifier accuracy and extra compute but fits existing RL pipelines.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Junyi Li, Hwee Tou Ng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FSPO reduces step-level hallucinations and raises reasoning accuracy, improving reliability for products that need trustworthy step-by-step explanations such as tutoring, medical assistants, and decision-support.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

The authors show that standard reinforcement learning (RL) fine-tuning for chain-of-thought reasoning increases hallucinations. They propose FSPO, a token-level RL method that adds automated step-wise factuality checks (via a verifier) to shape token advantages during training. Across math and hallucination benchmarks with Qwen2.5 and Llama models, FSPO reduces hallucinations and raises reasoning accuracy compared to vanilla RL baselines.

Problem Statement

Outcome-driven RL that rewards only final answers makes reasoning models more likely to produce unsupported or false intermediate steps (hallucinations). Sparse binary rewards create high-variance gradients, force high entropy (more random outputs), and allow spurious local optima where the model is confidently wrong.

Main Contribution

Empirical finding: RL-tuned reasoning models show higher hallucination rates across multiple benchmarks.

Theoretical analysis showing three causes for RL-induced hallucination: high-variance gradient, entropy-driven randomness, and spurious local optima under binary rewards.

Key Findings

Reasoning-focused RL increases hallucination rates versus non-RL models on standard benchmarks.

NumbersHaluEval-QA: open-source baseline 48.0% → FSPO 83.0% (accuracy), TruthfulQA: 38.2 → 58.4

Practical UseDon’t assume RL fine-tuning always improves factuality; add step-level checks or alternate rewards when training reasoning models.

Evidence RefTable 1 (HaluEval-QA, TruthfulQA)

FSPO improves math reasoning accuracy substantially over the base model.

NumbersGSM8K Pass@1: Qwen2.5-7B-Base 65.2% → FSPO 89.5% (Δ +24.3)

Practical UseApplying FSPO can lift reasoning performance on math benchmarks without sacrificing factuality; useful when accuracy and traceability both matter.

Evidence RefTable 1 (GSM8K)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GSM8K Pass@1	89.5%	Qwen2.5-7B-Base 65.2%	+24.3%	GSM8K	Table 1 shows FSPO (Qwen-Base) 89.5 vs base 65.2	Table 1
MATH-500 Pass@1	75.5%	Qwen2.5-7B-Base 35.7%	+39.8%	MATH-500	Table 1 reports FSPO (Qwen-Base) 75.5 vs base 35.7	Table 1

What To Try In 7 Days

Run a small FSPO-style fine-tune: add an automated verifier to give token-level rewards on 1–2k domain examples.

Measure hallucination rate before/after using the same judge (TruthfulQA or HaluEval) to quantify change.

Adopt step-level checks in your evaluation pipeline to catch unsupported intermediate claims early.

Optimization Features

Training Optimization

token-level advantage reweightingstep-wise reward shaping

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/nusnlp/FSPO

Data URLs

HotpotQA, 2WikiMultiHopQA, GSM8K, MATH-500, TruthfulQA, HaluEval, HalluQA (public datasets)

Risks & Boundaries

Limitations

Theory focuses on binary (1/0) rewards; extension to arbitrary dense rewards is left to future work.

Experiments use 7B–8B models; behavior on much larger models (14B–32B) is not tested due to compute limits.

When Not To Use

If authoritative evidence sources are not available for your task (FSPO needs evidence to verify steps).

When compute budget cannot afford verifier calls during training.

Failure Modes

Verifier mislabels a correct step as incorrect, causing useful tokens to be penalized.

Over-reliance on verifier leads model to game the verifier’s heuristics rather than learn true facts.

Core Entities

Models

Qwen2.5-7B-BaseQwen2.5-7B-InstructQwen2.5-14BQwen2.5-32BLlama3.1-8B-InstructQwQ-32BDeepSeek-V3DeepSeek-R1R1-Distill-Qwen-7BR1-Distill-Qwen-14BR1-Distill-Qwen-32BR1-Distill-Llama-8B

Metrics

Pass@1hallucination rateAccuracyfactuality score

Datasets

HotpotQA (subset)2WikiMultiHopQA (subset)SimpleRLTruthfulQAHaluEvalHalluQAGSM8KMATH-500AIME 2024AIME 2025

FSPO: reward-wise RL that checks factuality at each reasoning step to cut hallucinations and boost reasoning

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Reasoning-focused RL increases hallucination rates versus non-RL models on standard benchmarks.

FSPO improves math reasoning accuracy substantially over the base model.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Reasoning-focused RL increases hallucination rates versus non-RL models on standard benchmarks.

FSPO improves math reasoning accuracy substantially over the base model.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding