Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Better automated judges make preference data and alignment training more reliable; adding a modest set of structured MCTS pairs (thousands) yields large accuracy gains and reduces costly mislabeling.
Summary TLDR
The paper builds M-JudgeBench, a capability-oriented multimodal benchmark (3,712 contrastive pairs) that tests judges on two dimensions: result errors (pairwise Chain‑of‑Thought comparisons and length‑bias cases) and process errors (visual, logical, incidental). It proposes Judge‑MCTS, an MCTS rollout method that generates four controlled reasoning types (short/long × correct/error). Injecting 13k MCTS pairs into ~142k open-source pairwise samples yields the M-Judger models. M-Judger variants improve pairwise judge accuracy by several points to ~11.6 points overall on M-JudgeBench versus baselines on evaluated models, and especially reduce length bias and improve CoT discrimination. The code
Problem Statement
Existing multimodal judge benchmarks group by task type and final-answer correctness, but they miss the core judging abilities humans use: robust cross-style comparison, resistance to length bias, and detection of process-level reasoning errors. As a result, current judge models confuse similar‑length Chain‑of‑Thoughts and overvalue fluent long reasoning.
Main Contribution
M-JudgeBench: a capability-oriented multimodal judge benchmark with 3,712 curated pairwise instances across 10 subtasks (pairwise CoT, length-bias, process errors).
Judge‑MCTS: an MCTS-based pipeline that synthesizes structured reasoning trajectories (short/long × correct/error) to create fine‑grained pairwise supervision.
M-Judger models: SFT and RL variants trained with 13k MCTS‑augmented samples mixed into ~142k open-source pairs, yielding consistent accuracy gains on M-JudgeBench and existing judge benchmarks.
Key Findings
Judge models struggle to discriminate similar-length Chain‑of‑Thought pairs.
Length bias persists and causes judges to prefer long but incorrect CoTs over short correct answers.
Adding MCTS‑augmented data meaningfully improves judge performance.
M-JudgeBench composition and size
Results
M-JudgeBench size
Accuracy
Open-source pairwise training data
MCTS-augmented training samples
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run M-JudgeBench (3,712 pairs) against your judge model to find length bias and CoT blind spots.
Add a small MCTS-like synthetic set (≈10–20k structured pairs) to SFT and measure pairwise accuracy lift.
Validate judge outputs on short‑answer vs long‑CoT cases before using them to generate preference data.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- M-JudgeBench is focused on multimodal reasoning and excludes some safety‑style cases (authors excluded unclear safety items).
- MCTS augmentation size is modest (13k); effects on very large production judges need further validation.
- Benchmark emphasizes process/result judgment but still relies on exact‑match filtering and curated seeds, which may miss some real‑world diversity.
When Not To Use
- If your judge task is purely single‑modal or only cares about final answer correctness, this capability benchmark may be overkill.
- Don't assume MCTS data will fix all errors for very large proprietary judges without further tuning and evaluation.
Failure Modes
- Judges overvalue fluent long CoT and may prefer persuasive but incorrect reasoning.
- Models can fail to detect incidental or subtle process errors when answers are identical.
- Judge performance is tied to base multimodal understanding; weak perception models limit gains.
Core Entities
Models
- SFT
- Qwen3-VL (2B/4B/8B)
- Qwen2.5-VL-7B-Instruct
- Gemini 2.5 Pro
- GPT-4.1
- GPT-5
- GLM-4.5V
- Unified Reward
- UnifiedReward-Think
- R1-Reward
- InternVL3.5
Metrics
- Accuracy
Datasets
- M-JudgeBench
- MMMU
- MMMU-Pro
- MMStar
- MMReason
- M3CoT
- MathVision
- MathVerse
- Open-source pairwise mixture (~142k)
Benchmarks
- M-JudgeBench
- VL-RewardBench
- MMRB
- JudgeAnything

