A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

February 7, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper provides broad, repeatable experiments and human labels; results show clear strengths (pairwise) and weaknesses (scoring, ranking), so apply models conservatively with human checks.

Citations5

Evidence Strength0.70

Confidence0.86

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MLLMs can speed and scale human-like pairwise evaluation, but current models still fail at reliable numeric scoring and list ranking; use them to triage or pre-filter outputs, not to fully automate decisions.

Who Should Care

Summary TLDR

The authors build MLLM-as-a-Judge, a multimodal benchmark (4,414 image–instruction pairs) and two curated datasets (HQ and HARD) to measure how well Multimodal LLMs (MLLMs) act as evaluators. They test 11 MLLMs (e.g., GPT‑4V, Gemini, LLaVA, Qwen) across three tasks: Scoring (1–5), Pair Comparison (A vs B vs tie), and Batch Ranking (rank all). Key results: GPT‑4V closely matches humans on pairwise comparisons (avg ≈0.77 accuracy), but scoring and ranking are much weaker (Pearson ≈0.49 for scores; Levenshtein ≈0.36 for ranks). Chain-of-Thought reduces hallucinations but does not consistently improve alignment. Vision descriptions fed to LLMs can boost judging when models lack direct vision.

Problem Statement

There is no public, human‑annotated benchmark that measures how well multimodal LLMs can act as judges across image+instruction tasks. We need to know whether MLLMs can replace or assist humans when judging model outputs, and where they fail.

Main Contribution

A multimodal judge benchmark (MLLM‑as‑a‑Judge) covering 3 judging modes: Scoring, Pair Comparison, and Batch Ranking.

Two released datasets: MLLM‑AS‑A‑JUDGE‑HQ (high human agreement) and MLLM‑AS‑A‑JUDGE‑HARD (hallucinations and inconsistent cases).

Key Findings

MLLMs are reliable at pairwise comparisons but not at scoring or ranking.

NumbersPair (no tie) GPT‑4V avg=0.773; Score Pearson GPT‑4V=0.490; Batch Levenshtein GPT‑4V=0.361

Practical UseUse MLLMs (GPT‑4V best) for head‑to‑head checks (A vs B). Do not rely on them alone to give scalar quality scores or to produce full ranked lists without human checks.

Evidence RefTable 2; Table 9

GPT‑4V matches human agreement much more than other MLLMs on judged outputs.

NumbersHuman agreement GPT‑4V ≈70% average; Pair agreement peaks ≈79%–92% on some datasets

Practical UseIf you have a budget for a single automated judge, prioritize GPT‑4V for pairwise evaluation and human verification for other tasks.

Evidence RefTable 3; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pair Comparison (no tie) - GPT‑4V average0.773Average across 10 datasetsHigh alignment with human pairwise decisionsTable 2
Scoring Evaluation Pearson - GPT‑4V average0.490Average across 10 datasetsModerate correlation with human scoresTable 2

What To Try In 7 Days

Run pairwise A/B checks with GPT‑4V to shortlist best outputs before human review.

Generate a careful image description and feed it to a strong LLM if you must judge without vision models.

Add a short CoT step to your judge pipeline to reduce hallucinations, then sample human checks on failures.

Agent Features

Planning
Analyze-then-Judge promptingChain-of-Thought prompting (CoT)
Tool Use
Vision expert description (textual proxy for image)Json Mode for structured outputs (GPT‑4V)
Frameworks
Analyze-then-JudgeMulti-step CoT
Architectures
Vision-Language Models (encoder/decoder fusion)Multimodal instruction‑tuned LLMs
Collaboration
Human-in-the-loop annotation and agreement checks

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmarks cover many datasets but human annotation bias remains (authors were annotators).

Proprietary models (GPT‑4V, Gemini) dominate best results; access cost affects adoption.

When Not To Use

As the sole method for numeric scoring in high‑stakes evaluations.

To fully automate ranked decisions without human sampling.

Failure Modes

Egocentric bias: models prefer their own outputs.

Position bias: preference for answers in a fixed prompt position.

Core Entities

Models

GPT-4VGemini-Pro-VisionLLaVA-1.5-13bLLaVA-1.6-34bLLaVA-1.6-13bLLaVA-1.6-7bQwen-VL-MaxQwen-VL-PlusQwen-VL-ChatCogVLMLLaMA-2-70bMixtral-8x7bChatUniviVideoChat/Video-LLM

Metrics

Pearson similarity (scoring)AccuracyNormalized Levenshtein distance (batch ranking)Human agreement percentageMean Absolute Deviation (MAD)Majority Consistency Criterion (MCC)

Datasets

MS COCOConceptual CaptionsDiffusionDBChartQAInfographicVQAMathVistaTextVQAWITCC-3M (concept-balanced)VisIT-BenchMind2WebAesBenchScienceQAMM-VetMementos (sequential images)

Benchmarks

MLLM-as-a-JudgeMLLM-AS-A-JUDGE-HQMLLM-AS-A-JUDGE-HARD

Context Entities

Models

GPT-3.5LLaMAMixtral

Metrics

BLEU/METEOR/CIDEr (mentioned as insufficient)

Datasets

VQAChartQAInfographicVQA (context in related work)

Benchmarks

MM-VetMementos (used for sequential tests)