Overview
The paper gives strong empirical evidence across multiple models and benchmarks that RL-with-CoT can increase accuracy but reduce robustness and CoT faithfulness; results are reproducible but training is seed-sensitive.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 2/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
Higher benchmark scores from RL-tuned VLMs don't guarantee reliable, grounded reasoning; for product use (search, robotics, assistants) you must test for adversarial text cues and CoT faithfulness before deployment.
Who Should Care
Summary TLDR
The authors stress-test RL-finetuned vision-language models (VLMs) on simple visual reasoning tasks by injecting misleading captions and misleading chain-of-thought (CoT) starts. Open-source RL-tuned VLMs often lose accuracy or become unfaithful (CoT disagrees with final answer) under these small textual perturbations. Closed-source models show the same failure modes but are substantially more robust and more often produce faithful CoT. RL finetuning increases benchmark accuracy and reduces output entropy, yet often drives a trade-off: higher accuracy with less faithful, less robust reasoning. Data augmentation helps against wrong captions but not reliably against wrong CoT; adding a faith-fi
Problem Statement
Do RL-finetuned multimodal reasoning models truly reason from images, or do they rely on textual cues and produce unfaithful chains-of-thought? The paper probes whether small, controlled textual perturbations (misleading captions or misleading CoT seeds) reveal hidden brittleness and whether RL finetuning amplifies or mitigates these failures.
Main Contribution
A controlled stress-test: add Wrong-Caption and Wrong-Think perturbations to eight visual reasoning benchmarks to probe modality conflicts.
Empirical finding: open-source RL-finetuned VLMs lose accuracy and produce more unfaithful CoT under small textual perturbations.
Key Findings
Wrong-Think prompts cause substantial accuracy drops for some open-source VLMs.
RL finetuning narrows model output distributions (lower entropy) while increasing headline accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | -6.44% | Base prompt accuracy | -6.44% | Average across evaluated spatial datasets (Table 4) | Table 4 reports SpaceR −6.44 ± 3.89 mean delta under Wrong-Think | Table 4 |
| AUROC: P_base predicts robustness (Stop-Think, SpaceR) | 0.958 | — | — | Predicting robustness to Stop-Think perturbation (Table 6) | P_base AUROC 0.958 for SpaceR (Table 6) | Table 6 |
What To Try In 7 Days
Run Wrong-Caption and Wrong-Think probes on your VLM to reveal reliance on text context.
Measure P_base (probability on correct option) and entropy per sample; use P_base as a filter for robustness.
Add caption-augmentation to your RL or SFT pipeline to reduce caption-driven failures, then re-evaluate faithfulness separately.
Optimization Features
Model Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
High run-to-run variability across random seeds; some effects depend on seed.
Closed-model results are approximate because Wrong-Think must be enforced via prompt rather than enforced sampling.
When Not To Use
As sole evidence of model trustworthiness—accuracy alone is misleading.
To assume augmentation fixes all adversarial inputs—Wrong-Think remains hard.
Failure Modes
Models become confidently wrong (low entropy but incorrect) under adversarial prompts.
CoT–answer decoupling: correct answers paired with unfaithful reasoning traces.

