SEED-Bench: a 19K, 12-dimension multiple-choice benchmark for testing image and video LLM comprehension

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible (code/data links), but model gaps and dataset biases mean results should guide targeted improvements, not final deployment decisions.

Citations52

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 45%

Authors

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SEED-Bench gives a large, objective test to reveal real weaknesses in multimodal models (OCR, spatial relations, temporal reasoning), so businesses should validate models on similar slices before deploying image/video features.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

SEED-Bench is a large objective benchmark for multimodal LLMs. It provides 19K human-verified multiple-choice questions across 12 spatial and temporal dimensions (images + videos). Questions are generated with foundation models and ChatGPT/GPT-4, automatically filtered (5.52% removed) and human-checked. The authors evaluate 18 models and find most models score below 50% average; InstructBLIP leads (~53% overall) but gaps remain on fine-grained spatial relations, OCR, and temporal reasoning. The repo and leaderboard are public.

Problem Statement

Existing multimodal evaluations are small, subjective, or rely on free-form scoring by humans/GPT, making comparisons noisy. SEED-Bench builds a large, objective multiple-choice test (image+video) to measure generative comprehension across targeted dimensions.

Main Contribution

A large-scale, human-verified multiple-choice benchmark of 19K questions spanning 12 spatial and temporal dimensions.

A generation pipeline that extracts visual text (captions, instance descriptions, OCR), uses ChatGPT/GPT-4 to draft questions, filters non-visual questions automatically, then applies human verification.

Key Findings

SEED-Bench contains 19K human-verified multiple-choice questions across 12 dimensions.

Numbers19,242 questions; 12 dimensions

Practical UseUse this dataset for stronger, more stable comparisons of multimodal models than prior small benchmarks.

Evidence RefAbstract; Sec.3 (Fig.1, Fig.2)

5.52% of autogenerated questions were answerable without the image and were filtered out.

Numbers5.52% filtered

Practical UseExpect and remove shortcut questions when auto-generating QA from model outputs to avoid overestimating language-only skill.

Evidence RefSec.3.3 Automatic Filtering

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Benchmark size	19,242 multiple-choice questions	—	—	Overall (12 dims)	Sec.3 Abstract	Fig.1, Sec.3
Accuracy	53.37%	—	—	InstructBLIP (Vicuna) on SEED-Bench	Table 3; Sec.4.2	Table 3

What To Try In 7 Days

Run your model on SEED-Bench to find weak dimensions (OCR, relations, temporal).

Adopt likelihood-based ranking for multiple-choice evaluation to avoid label-formatting bugs.

If OCR matters, add specialized OCR preprocessing or fine-tune on text-rich image data and re-evaluate.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/AILab-CVC/SEED-Bench

Data URLs

https://ai.google/research/CC3M (CC3M)https://20bn.com/ (Something-Something-v2)https://epic-kitchens.github.io/ (Epic-Kitchen 100)https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/ (Breakfast)

Risks & Boundaries

Limitations

Temporal questions rely on dataset ground-truth rather than automatic video captioning, so automatic scalability for video QA is limited (Sec.3.3).

Text-recognition split is small (85 samples), so OCR conclusions have higher variance (Sec.3.3).

When Not To Use

Do not use SEED-Bench as the only safety or fairness test for production systems.

Avoid using it as a proxy for domain-specific video tasks that need specialized sensors or long-range context.

Failure Modes

Models may guess common-sense answers without using image evidence; automatic filtering reduces but does not eliminate this.

Some dimensions have uneven sample sizes, which can skew overall averages if not weighted (e.g., 4,649 vs 85).

Core Entities

Models

Flan-T5VicunaLLaMABLIP-2InstructBLIPInstructBLIP VicunaLLaVAMiniGPT-4VPGTransMultiModal-GPTOtterOpenFlamingoLLaMA-Adapter V2GVTmPLUG-OwlVideoChatVideo-ChatGPTValley

Metrics

AccuracyRank

Datasets

CC3MSomething-Something-v2Epic-Kitchen 100Breakfast

Benchmarks

MMEMMBenchLVLM-eHubLAMM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SEED-Bench contains 19K human-verified multiple-choice questions across 12 dimensions.

5.52% of autogenerated questions were answerable without the image and were filtered out.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-