A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Overview

Decision SnapshotNeeds Validation

The paper provides broad, repeatable experiments and human labels; results show clear strengths (pairwise) and weaknesses (scoring, ranking), so apply models conservatively with human checks.

Citations5

Evidence Strength0.70

Confidence0.86

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MLLMs can speed and scale human-like pairwise evaluation, but current models still fail at reliable numeric scoring and list ranking; use them to triage or pre-filter outputs, not to fully automate decisions.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors build MLLM-as-a-Judge, a multimodal benchmark (4,414 image–instruction pairs) and two curated datasets (HQ and HARD) to measure how well Multimodal LLMs (MLLMs) act as evaluators. They test 11 MLLMs (e.g., GPT‑4V, Gemini, LLaVA, Qwen) across three tasks: Scoring (1–5), Pair Comparison (A vs B vs tie), and Batch Ranking (rank all). Key results: GPT‑4V closely matches humans on pairwise comparisons (avg ≈0.77 accuracy), but scoring and ranking are much weaker (Pearson ≈0.49 for scores; Levenshtein ≈0.36 for ranks). Chain-of-Thought reduces hallucinations but does not consistently improve alignment. Vision descriptions fed to LLMs can boost judging when models lack direct vision.

Problem Statement

There is no public, human‑annotated benchmark that measures how well multimodal LLMs can act as judges across image+instruction tasks. We need to know whether MLLMs can replace or assist humans when judging model outputs, and where they fail.

Main Contribution

A multimodal judge benchmark (MLLM‑as‑a‑Judge) covering 3 judging modes: Scoring, Pair Comparison, and Batch Ranking.

Two released datasets: MLLM‑AS‑A‑JUDGE‑HQ (high human agreement) and MLLM‑AS‑A‑JUDGE‑HARD (hallucinations and inconsistent cases).

Key Findings

MLLMs are reliable at pairwise comparisons but not at scoring or ranking.

NumbersPair (no tie) GPT‑4V avg=0.773; Score Pearson GPT‑4V=0.490; Batch Levenshtein GPT‑4V=0.361

Practical UseUse MLLMs (GPT‑4V best) for head‑to‑head checks (A vs B). Do not rely on them alone to give scalar quality scores or to produce full ranked lists without human checks.

Evidence RefTable 2; Table 9

GPT‑4V matches human agreement much more than other MLLMs on judged outputs.

NumbersHuman agreement GPT‑4V ≈70% average; Pair agreement peaks ≈79%–92% on some datasets

Practical UseIf you have a budget for a single automated judge, prioritize GPT‑4V for pairwise evaluation and human verification for other tasks.

Evidence RefTable 3; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pair Comparison (no tie) - GPT‑4V average	0.773	—	—	Average across 10 datasets	High alignment with human pairwise decisions	Table 2
Scoring Evaluation Pearson - GPT‑4V average	0.490	—	—	Average across 10 datasets	Moderate correlation with human scores	Table 2

What To Try In 7 Days

Run pairwise A/B checks with GPT‑4V to shortlist best outputs before human review.

Generate a careful image description and feed it to a strong LLM if you must judge without vision models.

Add a short CoT step to your judge pipeline to reduce hallucinations, then sample human checks on failures.

Agent Features

Planning

Analyze-then-Judge promptingChain-of-Thought prompting (CoT)

Tool Use

Vision expert description (textual proxy for image)Json Mode for structured outputs (GPT‑4V)

Frameworks

Analyze-then-JudgeMulti-step CoT

Architectures

Vision-Language Models (encoder/decoder fusion)Multimodal instruction‑tuned LLMs

Collaboration

Human-in-the-loop annotation and agreement checks

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://mllm-judge.github.io/

Data URLs

https://mllm-judge.github.io/

Risks & Boundaries

Limitations

Benchmarks cover many datasets but human annotation bias remains (authors were annotators).

Proprietary models (GPT‑4V, Gemini) dominate best results; access cost affects adoption.

When Not To Use

As the sole method for numeric scoring in high‑stakes evaluations.

To fully automate ranked decisions without human sampling.

Failure Modes

Egocentric bias: models prefer their own outputs.

Position bias: preference for answers in a fixed prompt position.

Core Entities

Models

GPT-4VGemini-Pro-VisionLLaVA-1.5-13bLLaVA-1.6-34bLLaVA-1.6-13bLLaVA-1.6-7bQwen-VL-MaxQwen-VL-PlusQwen-VL-ChatCogVLMLLaMA-2-70bMixtral-8x7bChatUniviVideoChat/Video-LLM

Metrics

Pearson similarity (scoring)AccuracyNormalized Levenshtein distance (batch ranking)Human agreement percentageMean Absolute Deviation (MAD)Majority Consistency Criterion (MCC)

Datasets

MS COCOConceptual CaptionsDiffusionDBChartQAInfographicVQAMathVistaTextVQAWITCC-3M (concept-balanced)VisIT-BenchMind2WebAesBenchScienceQAMM-VetMementos (sequential images)

Benchmarks

MLLM-as-a-JudgeMLLM-AS-A-JUDGE-HQMLLM-AS-A-JUDGE-HARD

Context Entities

Models

GPT-3.5LLaMAMixtral

Metrics

BLEU/METEOR/CIDEr (mentioned as insufficient)

Datasets

VQAChartQAInfographicVQA (context in related work)

Benchmarks

MM-VetMementos (used for sequential tests)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MLLMs are reliable at pairwise comparisons but not at scoring or ranking.

GPT‑4V matches human agreement much more than other MLLMs on judged outputs.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding