A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

February 7, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

5

Authors

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun

Links

Abstract / PDF

Why It Matters For Business

MLLMs can speed and scale human-like pairwise evaluation, but current models still fail at reliable numeric scoring and list ranking; use them to triage or pre-filter outputs, not to fully automate decisions.

Summary TLDR

The authors build MLLM-as-a-Judge, a multimodal benchmark (4,414 image–instruction pairs) and two curated datasets (HQ and HARD) to measure how well Multimodal LLMs (MLLMs) act as evaluators. They test 11 MLLMs (e.g., GPT‑4V, Gemini, LLaVA, Qwen) across three tasks: Scoring (1–5), Pair Comparison (A vs B vs tie), and Batch Ranking (rank all). Key results: GPT‑4V closely matches humans on pairwise comparisons (avg ≈0.77 accuracy), but scoring and ranking are much weaker (Pearson ≈0.49 for scores; Levenshtein ≈0.36 for ranks). Chain-of-Thought reduces hallucinations but does not consistently improve alignment. Vision descriptions fed to LLMs can boost judging when models lack direct vision.

Problem Statement

There is no public, human‑annotated benchmark that measures how well multimodal LLMs can act as judges across image+instruction tasks. We need to know whether MLLMs can replace or assist humans when judging model outputs, and where they fail.

Main Contribution

A multimodal judge benchmark (MLLM‑as‑a‑Judge) covering 3 judging modes: Scoring, Pair Comparison, and Batch Ranking.

Two released datasets: MLLM‑AS‑A‑JUDGE‑HQ (high human agreement) and MLLM‑AS‑A‑JUDGE‑HARD (hallucinations and inconsistent cases).

A broad evaluation of 11 commercial and open MLLMs, plus analysis of biases, hallucinations, consistency, and mitigation with CoT and vision descriptions.

Key Findings

MLLMs are reliable at pairwise comparisons but not at scoring or ranking.

NumbersPair (no tie) GPT‑4V avg=0.773; Score Pearson GPT‑4V=0.490; Batch Levenshtein GPT‑4V=0.361

GPT‑4V matches human agreement much more than other MLLMs on judged outputs.

NumbersHuman agreement GPT‑4V ≈70% average; Pair agreement peaks ≈79%–92% on some datasets

Chain‑of‑Thought (CoT) reduces hallucinations but rarely raises human alignment.

NumbersHallucination reduction examples: Score 46.15% reduction; Pair 28.21% reduction; Batch 43.59% reduction

Providing a detailed image description to an LLM improves judging over no vision.

NumbersScore Pearson with vision description (GPT family) up to 0.435 vs no vision 0.299

Common failure modes are systematic: egocentric, position, and length (verbosity) biases plus hallucinations.

NumbersLength bias gave avg score increases of ≈0.6 (GPT‑4V) and ≈0.75 (Gemini) when answers were lengthened

Results

Pair Comparison (no tie) - GPT‑4V average

Value0.773

Scoring Evaluation Pearson - GPT‑4V average

Value0.490

Batch Ranking distance (lower better) - GPT‑4V average

Value0.361

Human agreement - GPT‑4V average

Value≈0.70

Hallucination reduction with extra CoT steps

ValueScore: 46.15% reduction (example)

Who Should Care

What To Try In 7 Days

Run pairwise A/B checks with GPT‑4V to shortlist best outputs before human review.

Generate a careful image description and feed it to a strong LLM if you must judge without vision models.

Add a short CoT step to your judge pipeline to reduce hallucinations, then sample human checks on failures.

Agent Features

Planning

  • Analyze-then-Judge prompting
  • Chain-of-Thought prompting (CoT)

Tool Use

  • Vision expert description (textual proxy for image)
  • Json Mode for structured outputs (GPT‑4V)

Frameworks

  • Analyze-then-Judge
  • Multi-step CoT

Architectures

  • Vision-Language Models (encoder/decoder fusion)
  • Multimodal instruction‑tuned LLMs

Collaboration

  • Human-in-the-loop annotation and agreement checks

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmarks cover many datasets but human annotation bias remains (authors were annotators).
  • Proprietary models (GPT‑4V, Gemini) dominate best results; access cost affects adoption.
  • Scoring and batch ranking remain unreliable across datasets and models.
  • CoT lowers hallucination frequency but often reduces alignment with human scores.

When Not To Use

  • As the sole method for numeric scoring in high‑stakes evaluations.
  • To fully automate ranked decisions without human sampling.
  • When hallucination or bias cannot be tolerated (safety‑critical systems).

Failure Modes

  • Egocentric bias: models prefer their own outputs.
  • Position bias: preference for answers in a fixed prompt position.
  • Length/verbosity bias: longer answers score higher.
  • Hallucinations in long chains or complex visual reasoning.
  • Inconsistency across repeated judgments (low MCC in some tasks).

Core Entities

Models

  • GPT-4V
  • Gemini-Pro-Vision
  • LLaVA-1.5-13b
  • LLaVA-1.6-34b
  • LLaVA-1.6-13b
  • LLaVA-1.6-7b
  • Qwen-VL-Max
  • Qwen-VL-Plus
  • Qwen-VL-Chat
  • CogVLM
  • LLaMA-2-70b
  • Mixtral-8x7b
  • ChatUnivi
  • VideoChat/Video-LLM

Metrics

  • Pearson similarity (scoring)
  • Accuracy
  • Normalized Levenshtein distance (batch ranking)
  • Human agreement percentage
  • Mean Absolute Deviation (MAD)
  • Majority Consistency Criterion (MCC)

Datasets

  • MS COCO
  • Conceptual Captions
  • DiffusionDB
  • ChartQA
  • InfographicVQA
  • MathVista
  • TextVQA
  • WIT
  • CC-3M (concept-balanced)
  • VisIT-Bench
  • Mind2Web
  • AesBench
  • ScienceQA
  • MM-Vet
  • Mementos (sequential images)

Benchmarks

  • MLLM-as-a-Judge
  • MLLM-AS-A-JUDGE-HQ
  • MLLM-AS-A-JUDGE-HARD

Context Entities

Models

  • GPT-3.5
  • LLaMA
  • Mixtral

Metrics

  • BLEU/METEOR/CIDEr (mentioned as insufficient)

Datasets

  • VQA
  • ChartQA
  • InfographicVQA (context in related work)

Benchmarks

  • MM-Vet
  • Mementos (used for sequential tests)