M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

February 28, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Zeyu Chen, Huanjin Yao, Ziwang Zhao, Min Yang

Links

Abstract / PDF

Why It Matters For Business

Better automated judges make preference data and alignment training more reliable; adding a modest set of structured MCTS pairs (thousands) yields large accuracy gains and reduces costly mislabeling.

Summary TLDR

The paper builds M-JudgeBench, a capability-oriented multimodal benchmark (3,712 contrastive pairs) that tests judges on two dimensions: result errors (pairwise Chain‑of‑Thought comparisons and length‑bias cases) and process errors (visual, logical, incidental). It proposes Judge‑MCTS, an MCTS rollout method that generates four controlled reasoning types (short/long × correct/error). Injecting 13k MCTS pairs into ~142k open-source pairwise samples yields the M-Judger models. M-Judger variants improve pairwise judge accuracy by several points to ~11.6 points overall on M-JudgeBench versus baselines on evaluated models, and especially reduce length bias and improve CoT discrimination. The code

Problem Statement

Existing multimodal judge benchmarks group by task type and final-answer correctness, but they miss the core judging abilities humans use: robust cross-style comparison, resistance to length bias, and detection of process-level reasoning errors. As a result, current judge models confuse similar‑length Chain‑of‑Thoughts and overvalue fluent long reasoning.

Main Contribution

M-JudgeBench: a capability-oriented multimodal judge benchmark with 3,712 curated pairwise instances across 10 subtasks (pairwise CoT, length-bias, process errors).

Judge‑MCTS: an MCTS-based pipeline that synthesizes structured reasoning trajectories (short/long × correct/error) to create fine‑grained pairwise supervision.

M-Judger models: SFT and RL variants trained with 13k MCTS‑augmented samples mixed into ~142k open-source pairs, yielding consistent accuracy gains on M-JudgeBench and existing judge benchmarks.

Key Findings

Judge models struggle to discriminate similar-length Chain‑of‑Thought pairs.

NumbersCoT pairwise accuracy ≈50%–70% across models

Length bias persists and causes judges to prefer long but incorrect CoTs over short correct answers.

NumbersMany models show random/imbalanced preferences; example baselines prefer long CoT in length tests

Adding MCTS‑augmented data meaningfully improves judge performance.

NumbersQwen3‑VL‑8B: overall accuracy 50.78 → 62.42 (+11.6 points) on M-JudgeBench

M-JudgeBench composition and size

Numbers3,712 total pairs = 1,364 CoT + 1,610 length bias + 738 process error

Results

M-JudgeBench size

Value3,712 pairs (1,364 CoT, 1,610 length-bias, 738 process-error)

Accuracy

Value≈50%–70% pairwise accuracy

Open-source pairwise training data

Value≈142k pairs

MCTS-augmented training samples

Value≈13k pairs

Accuracy

Value50.78 → 62.42 (overall acc)

Baseline50.78 (baseline Qwen3‑VL‑8B)

Accuracy

Value50.00 → 60.96 (overall acc)

Baseline50.00 (baseline Qwen3‑VL‑4B)

Who Should Care

What To Try In 7 Days

Run M-JudgeBench (3,712 pairs) against your judge model to find length bias and CoT blind spots.

Add a small MCTS-like synthetic set (≈10–20k structured pairs) to SFT and measure pairwise accuracy lift.

Validate judge outputs on short‑answer vs long‑CoT cases before using them to generate preference data.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • M-JudgeBench is focused on multimodal reasoning and excludes some safety‑style cases (authors excluded unclear safety items).
  • MCTS augmentation size is modest (13k); effects on very large production judges need further validation.
  • Benchmark emphasizes process/result judgment but still relies on exact‑match filtering and curated seeds, which may miss some real‑world diversity.

When Not To Use

  • If your judge task is purely single‑modal or only cares about final answer correctness, this capability benchmark may be overkill.
  • Don't assume MCTS data will fix all errors for very large proprietary judges without further tuning and evaluation.

Failure Modes

  • Judges overvalue fluent long CoT and may prefer persuasive but incorrect reasoning.
  • Models can fail to detect incidental or subtle process errors when answers are identical.
  • Judge performance is tied to base multimodal understanding; weak perception models limit gains.

Core Entities

Models

  • SFT
  • Qwen3-VL (2B/4B/8B)
  • Qwen2.5-VL-7B-Instruct
  • Gemini 2.5 Pro
  • GPT-4.1
  • GPT-5
  • GLM-4.5V
  • Unified Reward
  • UnifiedReward-Think
  • R1-Reward
  • InternVL3.5

Metrics

  • Accuracy

Datasets

  • M-JudgeBench
  • MMMU
  • MMMU-Pro
  • MMStar
  • MMReason
  • M3CoT
  • MathVision
  • MathVerse
  • Open-source pairwise mixture (~142k)

Benchmarks

  • M-JudgeBench
  • VL-RewardBench
  • MMRB
  • JudgeAnything