M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

February 28, 20267 min

Overview

Decision SnapshotReady For Pilot

The paper provides a concrete benchmark, data counts (3,712 benchmark pairs; 13k MCTS samples; 142k open training pairs), and multi‑table evaluations showing consistent accuracy gains across backbones; results are reproducible in principle though some training details will be released later.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Zeyu Chen, Huanjin Yao, Ziwang Zhao, Min Yang

Links

Abstract / PDF / Code

Why It Matters For Business

Better automated judges make preference data and alignment training more reliable; adding a modest set of structured MCTS pairs (thousands) yields large accuracy gains and reduces costly mislabeling.

Who Should Care

Summary TLDR

The paper builds M-JudgeBench, a capability-oriented multimodal benchmark (3,712 contrastive pairs) that tests judges on two dimensions: result errors (pairwise Chain‑of‑Thought comparisons and length‑bias cases) and process errors (visual, logical, incidental). It proposes Judge‑MCTS, an MCTS rollout method that generates four controlled reasoning types (short/long × correct/error). Injecting 13k MCTS pairs into ~142k open-source pairwise samples yields the M-Judger models. M-Judger variants improve pairwise judge accuracy by several points to ~11.6 points overall on M-JudgeBench versus baselines on evaluated models, and especially reduce length bias and improve CoT discrimination. The code

Problem Statement

Existing multimodal judge benchmarks group by task type and final-answer correctness, but they miss the core judging abilities humans use: robust cross-style comparison, resistance to length bias, and detection of process-level reasoning errors. As a result, current judge models confuse similar‑length Chain‑of‑Thoughts and overvalue fluent long reasoning.

Main Contribution

M-JudgeBench: a capability-oriented multimodal judge benchmark with 3,712 curated pairwise instances across 10 subtasks (pairwise CoT, length-bias, process errors).

Judge‑MCTS: an MCTS-based pipeline that synthesizes structured reasoning trajectories (short/long × correct/error) to create fine‑grained pairwise supervision.

Key Findings

Judge models struggle to discriminate similar-length Chain‑of‑Thought pairs.

NumbersCoT pairwise accuracy ≈50%–70% across models

Practical UseDon't rely on off‑the‑shelf judge models to spot subtle reasoning errors; add capability‑focused training or test with same‑style CoT pairs.

Evidence RefMain text, Section 4.2.1; Table 1

Length bias persists and causes judges to prefer long but incorrect CoTs over short correct answers.

NumbersMany models show random/imbalanced preferences; example baselines prefer long CoT in length tests

Practical UseWhen using judge scores in training, explicitly test and correct length bias (e.g., include length-contrast pairs).

Evidence RefSection 4.2.1 discussion; Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
M-JudgeBench size3,712 pairs (1,364 CoT, 1,610 length-bias, 738 process-error)M-JudgeBenchSection 2.3Section 2.3
Accuracy≈50%–70% pairwise accuracyPairwise CoT comparison (M-JudgeBench)Section 4.2.1; Table 1Table 1

What To Try In 7 Days

Run M-JudgeBench (3,712 pairs) against your judge model to find length bias and CoT blind spots.

Add a small MCTS-like synthetic set (≈10–20k structured pairs) to SFT and measure pairwise accuracy lift.

Validate judge outputs on short‑answer vs long‑CoT cases before using them to generate preference data.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

M-JudgeBench is focused on multimodal reasoning and excludes some safety‑style cases (authors excluded unclear safety items).

MCTS augmentation size is modest (13k); effects on very large production judges need further validation.

When Not To Use

If your judge task is purely single‑modal or only cares about final answer correctness, this capability benchmark may be overkill.

Don't assume MCTS data will fix all errors for very large proprietary judges without further tuning and evaluation.

Failure Modes

Judges overvalue fluent long CoT and may prefer persuasive but incorrect reasoning.

Models can fail to detect incidental or subtle process errors when answers are identical.

Core Entities

Models

SFTQwen3-VL (2B/4B/8B)Qwen2.5-VL-7B-InstructGemini 2.5 ProGPT-4.1GPT-5GLM-4.5VUnified RewardUnifiedReward-ThinkR1-RewardInternVL3.5

Metrics

Accuracy

Datasets

M-JudgeBenchMMMUMMMU-ProMMStarMMReasonM3CoTMathVisionMathVerseOpen-source pairwise mixture (~142k)

Benchmarks

M-JudgeBenchVL-RewardBenchMMRBJudgeAnything