RL fine-tuning raises visual reasoning scores but weakens reasoning faithfulness and robustness to misleading text

February 13, 20268 min

Overview

Decision SnapshotNeeds Validation

The paper gives strong empirical evidence across multiple models and benchmarks that RL-with-CoT can increase accuracy but reduce robustness and CoT faithfulness; results are reproducible but training is seed-sensitive.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 50%

Authors

Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, Arnab Mondal

Links

Abstract / PDF / Data

Why It Matters For Business

Higher benchmark scores from RL-tuned VLMs don't guarantee reliable, grounded reasoning; for product use (search, robotics, assistants) you must test for adversarial text cues and CoT faithfulness before deployment.

Who Should Care

Summary TLDR

The authors stress-test RL-finetuned vision-language models (VLMs) on simple visual reasoning tasks by injecting misleading captions and misleading chain-of-thought (CoT) starts. Open-source RL-tuned VLMs often lose accuracy or become unfaithful (CoT disagrees with final answer) under these small textual perturbations. Closed-source models show the same failure modes but are substantially more robust and more often produce faithful CoT. RL finetuning increases benchmark accuracy and reduces output entropy, yet often drives a trade-off: higher accuracy with less faithful, less robust reasoning. Data augmentation helps against wrong captions but not reliably against wrong CoT; adding a faith-fi

Problem Statement

Do RL-finetuned multimodal reasoning models truly reason from images, or do they rely on textual cues and produce unfaithful chains-of-thought? The paper probes whether small, controlled textual perturbations (misleading captions or misleading CoT seeds) reveal hidden brittleness and whether RL finetuning amplifies or mitigates these failures.

Main Contribution

A controlled stress-test: add Wrong-Caption and Wrong-Think perturbations to eight visual reasoning benchmarks to probe modality conflicts.

Empirical finding: open-source RL-finetuned VLMs lose accuracy and produce more unfaithful CoT under small textual perturbations.

Key Findings

Wrong-Think prompts cause substantial accuracy drops for some open-source VLMs.

NumbersSpaceR mean accuracy delta: −6.44% (Wrong-Think, Table 4)

Practical UseStress-test models by injecting misleading CoT; expect multi-percent accuracy loss—use this to detect brittle reasoning pipelines.

Evidence RefTable 4

RL finetuning narrows model output distributions (lower entropy) while increasing headline accuracy.

Practical UseDon't trust rising benchmark scores alone—track entropy and faithfulness metrics to spot overconfidence and hidden fragility.

Evidence RefFigure 9

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy-6.44%Base prompt accuracy-6.44%Average across evaluated spatial datasets (Table 4)Table 4 reports SpaceR −6.44 ± 3.89 mean delta under Wrong-ThinkTable 4
AUROC: P_base predicts robustness (Stop-Think, SpaceR)0.958Predicting robustness to Stop-Think perturbation (Table 6)P_base AUROC 0.958 for SpaceR (Table 6)Table 6

What To Try In 7 Days

Run Wrong-Caption and Wrong-Think probes on your VLM to reveal reliance on text context.

Measure P_base (probability on correct option) and entropy per sample; use P_base as a filter for robustness.

Add caption-augmentation to your RL or SFT pipeline to reduce caption-driven failures, then re-evaluate faithfulness separately.

Optimization Features

Model Optimization
GRPO
Training Optimization
R_correctfaithfulness-as-reward (LLM-as-judge check)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

3DSRBench, CV-Bench, Spatial-MM, WhatsUp, MMBench, V*-Bench, MME-RealWorld-Lite (datasets named in paper)

Risks & Boundaries

Limitations

High run-to-run variability across random seeds; some effects depend on seed.

Closed-model results are approximate because Wrong-Think must be enforced via prompt rather than enforced sampling.

When Not To Use

As sole evidence of model trustworthiness—accuracy alone is misleading.

To assume augmentation fixes all adversarial inputs—Wrong-Think remains hard.

Failure Modes

Models become confidently wrong (low entropy but incorrect) under adversarial prompts.

CoT–answer decoupling: correct answers paired with unfaithful reasoning traces.

Core Entities

Models

Qwen2.5-VL-7B-InstructSpaceRVideo-R1Vision-R1VLAA-ThinkerViGoRL-SpatialQwen3-32B (judge)GPT-OSS-120B (judge)Llama3.1-70B-Instruct (judge)o3o4-miniGemini-2.5-Pro

Metrics

Accuracyletter entropyP(Correct Letter)AUROC (predicting robustness from base confidence)faithfulness proportion (CoT vs answer)

Datasets

3DSRBenchCV-BenchSpatial-MMWhatsUpV*-BenchMMBenchMME-RealWorld-LiteSAT2Pixmo-CountGeometry3K

Benchmarks

3DSRBenchCV-BenchSpatial-MMWhatsUpMMBenchV*-BenchMME-RealWorld-Lite