Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
FSPO reduces step-level hallucinations and raises reasoning accuracy, improving reliability for products that need trustworthy step-by-step explanations such as tutoring, medical assistants, and decision-support.
Summary TLDR
The authors show that standard reinforcement learning (RL) fine-tuning for chain-of-thought reasoning increases hallucinations. They propose FSPO, a token-level RL method that adds automated step-wise factuality checks (via a verifier) to shape token advantages during training. Across math and hallucination benchmarks with Qwen2.5 and Llama models, FSPO reduces hallucinations and raises reasoning accuracy compared to vanilla RL baselines.
Problem Statement
Outcome-driven RL that rewards only final answers makes reasoning models more likely to produce unsupported or false intermediate steps (hallucinations). Sparse binary rewards create high-variance gradients, force high entropy (more random outputs), and allow spurious local optima where the model is confidently wrong.
Main Contribution
Empirical finding: RL-tuned reasoning models show higher hallucination rates across multiple benchmarks.
Theoretical analysis showing three causes for RL-induced hallucination: high-variance gradient, entropy-driven randomness, and spurious local optima under binary rewards.
FSPO algorithm: integrate automated step-wise factuality verification into token-level advantage adjustment to reward factual tokens and penalize incorrect ones.
Extensive experiments on math and hallucination benchmarks (Qwen2.5 and Llama backbones) showing FSPO reduces hallucinations while improving or maintaining reasoning scores.
Open-source code: implementation and training recipes released on GitHub.
Key Findings
Reasoning-focused RL increases hallucination rates versus non-RL models on standard benchmarks.
FSPO improves math reasoning accuracy substantially over the base model.
Adding step-wise factuality stabilizes RL updates by providing denser feedback and non-zero gradients even when final answer is wrong.
FSPO generalizes across RL algorithms and data sizes.
Results
GSM8K Pass@1
MATH-500 Pass@1
TruthfulQA (truthful ratio)
Accuracy
HalluQA (truthful ratio)
Who Should Care
What To Try In 7 Days
Run a small FSPO-style fine-tune: add an automated verifier to give token-level rewards on 1–2k domain examples.
Measure hallucination rate before/after using the same judge (TruthfulQA or HaluEval) to quantify change.
Adopt step-level checks in your evaluation pipeline to catch unsupported intermediate claims early.
Optimization Features
Training Optimization
- token-level advantage reweighting
- step-wise reward shaping
Reproducibility
Code Urls
Data Urls
- HotpotQA, 2WikiMultiHopQA, GSM8K, MATH-500, TruthfulQA, HaluEval, HalluQA (public datasets)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Theory focuses on binary (1/0) rewards; extension to arbitrary dense rewards is left to future work.
- Experiments use 7B–8B models; behavior on much larger models (14B–32B) is not tested due to compute limits.
- FSPO depends on the quality of the automated verifier; verifier errors can misguide token rewards.
When Not To Use
- If authoritative evidence sources are not available for your task (FSPO needs evidence to verify steps).
- When compute budget cannot afford verifier calls during training.
- For tasks where intermediate steps are not semantically meaningful or are intentionally creative.
Failure Modes
- Verifier mislabels a correct step as incorrect, causing useful tokens to be penalized.
- Over-reliance on verifier leads model to game the verifier’s heuristics rather than learn true facts.
- If evidence coverage is low, step-wise rewards may be sparse and fail to prevent spurious optima.
Core Entities
Models
- Qwen2.5-7B-Base
- Qwen2.5-7B-Instruct
- Qwen2.5-14B
- Qwen2.5-32B
- Llama3.1-8B-Instruct
- QwQ-32B
- DeepSeek-V3
- DeepSeek-R1
- R1-Distill-Qwen-7B
- R1-Distill-Qwen-14B
- R1-Distill-Qwen-32B
- R1-Distill-Llama-8B
Metrics
- Pass@1
- hallucination rate
- Accuracy
- factuality score
Datasets
- HotpotQA (subset)
- 2WikiMultiHopQA (subset)
- SimpleRL
- TruthfulQA
- HaluEval
- HalluQA
- GSM8K
- MATH-500
- AIME 2024
- AIME 2025
Benchmarks
- GSM8K
- MATH-500
- AIME 2024
- AIME 2025
- TruthfulQA
- HaluEval-QA
- HalluQA
Context Entities
Models
- GPT-4o
- GPT-o1
- DeepSeek-R1 (reference)
- DeepSeek-V3 (reference)

