SEED-Bench: a 19K, 12-dimension multiple-choice benchmark for testing image and video LLM comprehension

July 30, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible (code/data links), but model gaps and dataset biases mean results should guide targeted improvements, not final deployment decisions.

Citations52

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 45%

Authors

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SEED-Bench gives a large, objective test to reveal real weaknesses in multimodal models (OCR, spatial relations, temporal reasoning), so businesses should validate models on similar slices before deploying image/video features.

Who Should Care

Summary TLDR

SEED-Bench is a large objective benchmark for multimodal LLMs. It provides 19K human-verified multiple-choice questions across 12 spatial and temporal dimensions (images + videos). Questions are generated with foundation models and ChatGPT/GPT-4, automatically filtered (5.52% removed) and human-checked. The authors evaluate 18 models and find most models score below 50% average; InstructBLIP leads (~53% overall) but gaps remain on fine-grained spatial relations, OCR, and temporal reasoning. The repo and leaderboard are public.

Problem Statement

Existing multimodal evaluations are small, subjective, or rely on free-form scoring by humans/GPT, making comparisons noisy. SEED-Bench builds a large, objective multiple-choice test (image+video) to measure generative comprehension across targeted dimensions.

Main Contribution

A large-scale, human-verified multiple-choice benchmark of 19K questions spanning 12 spatial and temporal dimensions.

A generation pipeline that extracts visual text (captions, instance descriptions, OCR), uses ChatGPT/GPT-4 to draft questions, filters non-visual questions automatically, then applies human verification.

Key Findings

SEED-Bench contains 19K human-verified multiple-choice questions across 12 dimensions.

Numbers19,242 questions; 12 dimensions

Practical UseUse this dataset for stronger, more stable comparisons of multimodal models than prior small benchmarks.

Evidence RefAbstract; Sec.3 (Fig.1, Fig.2)

5.52% of autogenerated questions were answerable without the image and were filtered out.

Numbers5.52% filtered

Practical UseExpect and remove shortcut questions when auto-generating QA from model outputs to avoid overestimating language-only skill.

Evidence RefSec.3.3 Automatic Filtering

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Benchmark size19,242 multiple-choice questionsOverall (12 dims)Sec.3 AbstractFig.1, Sec.3
Accuracy53.37%InstructBLIP (Vicuna) on SEED-BenchTable 3; Sec.4.2Table 3

What To Try In 7 Days

Run your model on SEED-Bench to find weak dimensions (OCR, relations, temporal).

Adopt likelihood-based ranking for multiple-choice evaluation to avoid label-formatting bugs.

If OCR matters, add specialized OCR preprocessing or fine-tune on text-rich image data and re-evaluate.

Reproducibility

Risks & Boundaries

Limitations

Temporal questions rely on dataset ground-truth rather than automatic video captioning, so automatic scalability for video QA is limited (Sec.3.3).

Text-recognition split is small (85 samples), so OCR conclusions have higher variance (Sec.3.3).

When Not To Use

Do not use SEED-Bench as the only safety or fairness test for production systems.

Avoid using it as a proxy for domain-specific video tasks that need specialized sensors or long-range context.

Failure Modes

Models may guess common-sense answers without using image evidence; automatic filtering reduces but does not eliminate this.

Some dimensions have uneven sample sizes, which can skew overall averages if not weighted (e.g., 4,649 vs 85).

Core Entities

Models

Flan-T5VicunaLLaMABLIP-2InstructBLIPInstructBLIP VicunaLLaVAMiniGPT-4VPGTransMultiModal-GPTOtterOpenFlamingoLLaMA-Adapter V2GVTmPLUG-OwlVideoChatVideo-ChatGPTValley

Metrics

AccuracyRank

Datasets

CC3MSomething-Something-v2Epic-Kitchen 100Breakfast

Benchmarks

MMEMMBenchLVLM-eHubLAMM