M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-

Overview

Decision SnapshotReady For Pilot

The paper provides a concrete benchmark, data counts (3,712 benchmark pairs; 13k MCTS samples; 142k open training pairs), and multi‑table evaluations showing consistent accuracy gains across backbones; results are reproducible in principle though some training details will be released later.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Zeyu Chen, Huanjin Yao, Ziwang Zhao, Min Yang

Links

Abstract / PDF / Code

Why It Matters For Business

Better automated judges make preference data and alignment training more reliable; adding a modest set of structured MCTS pairs (thousands) yields large accuracy gains and reduces costly mislabeling.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper builds M-JudgeBench, a capability-oriented multimodal benchmark (3,712 contrastive pairs) that tests judges on two dimensions: result errors (pairwise Chain‑of‑Thought comparisons and length‑bias cases) and process errors (visual, logical, incidental). It proposes Judge‑MCTS, an MCTS rollout method that generates four controlled reasoning types (short/long × correct/error). Injecting 13k MCTS pairs into ~142k open-source pairwise samples yields the M-Judger models. M-Judger variants improve pairwise judge accuracy by several points to ~11.6 points overall on M-JudgeBench versus baselines on evaluated models, and especially reduce length bias and improve CoT discrimination. The code

Problem Statement

Existing multimodal judge benchmarks group by task type and final-answer correctness, but they miss the core judging abilities humans use: robust cross-style comparison, resistance to length bias, and detection of process-level reasoning errors. As a result, current judge models confuse similar‑length Chain‑of‑Thoughts and overvalue fluent long reasoning.

Main Contribution

M-JudgeBench: a capability-oriented multimodal judge benchmark with 3,712 curated pairwise instances across 10 subtasks (pairwise CoT, length-bias, process errors).

Judge‑MCTS: an MCTS-based pipeline that synthesizes structured reasoning trajectories (short/long × correct/error) to create fine‑grained pairwise supervision.

Key Findings

Judge models struggle to discriminate similar-length Chain‑of‑Thought pairs.

NumbersCoT pairwise accuracy ≈50%–70% across models

Practical UseDon't rely on off‑the‑shelf judge models to spot subtle reasoning errors; add capability‑focused training or test with same‑style CoT pairs.

Evidence RefMain text, Section 4.2.1; Table 1

Length bias persists and causes judges to prefer long but incorrect CoTs over short correct answers.

NumbersMany models show random/imbalanced preferences; example baselines prefer long CoT in length tests

Practical UseWhen using judge scores in training, explicitly test and correct length bias (e.g., include length-contrast pairs).

Evidence RefSection 4.2.1 discussion; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
M-JudgeBench size	3,712 pairs (1,364 CoT, 1,610 length-bias, 738 process-error)	—	—	M-JudgeBench	Section 2.3	Section 2.3
Accuracy	≈50%–70% pairwise accuracy	—	—	Pairwise CoT comparison (M-JudgeBench)	Section 4.2.1; Table 1	Table 1

What To Try In 7 Days

Run M-JudgeBench (3,712 pairs) against your judge model to find length bias and CoT blind spots.

Add a small MCTS-like synthetic set (≈10–20k structured pairs) to SFT and measure pairwise accuracy lift.

Validate judge outputs on short‑answer vs long‑CoT cases before using them to generate preference data.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/czythu/M Judger

Risks & Boundaries

Limitations

M-JudgeBench is focused on multimodal reasoning and excludes some safety‑style cases (authors excluded unclear safety items).

MCTS augmentation size is modest (13k); effects on very large production judges need further validation.

When Not To Use

If your judge task is purely single‑modal or only cares about final answer correctness, this capability benchmark may be overkill.

Don't assume MCTS data will fix all errors for very large proprietary judges without further tuning and evaluation.

Failure Modes

Judges overvalue fluent long CoT and may prefer persuasive but incorrect reasoning.

Models can fail to detect incidental or subtle process errors when answers are identical.

Core Entities

Models

SFTQwen3-VL (2B/4B/8B)Qwen2.5-VL-7B-InstructGemini 2.5 ProGPT-4.1GPT-5GLM-4.5VUnified RewardUnifiedReward-ThinkR1-RewardInternVL3.5

Metrics

Accuracy

Datasets

M-JudgeBenchMMMUMMMU-ProMMStarMMReasonM3CoTMathVisionMathVerseOpen-source pairwise mixture (~142k)

Benchmarks

M-JudgeBenchVL-RewardBenchMMRBJudgeAnything

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Judge models struggle to discriminate similar-length Chain‑of‑Thought pairs.

Length bias persists and causes judges to prefer long but incorrect CoTs over short correct answers.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding