SEED-Bench: a 19K, 12-dimension multiple-choice benchmark for testing image and video LLM comprehension

July 30, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.45

Cost Impact Score

0.3

Citation Count

52

Authors

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan

Links

Abstract / PDF

Why It Matters For Business

SEED-Bench gives a large, objective test to reveal real weaknesses in multimodal models (OCR, spatial relations, temporal reasoning), so businesses should validate models on similar slices before deploying image/video features.

Summary TLDR

SEED-Bench is a large objective benchmark for multimodal LLMs. It provides 19K human-verified multiple-choice questions across 12 spatial and temporal dimensions (images + videos). Questions are generated with foundation models and ChatGPT/GPT-4, automatically filtered (5.52% removed) and human-checked. The authors evaluate 18 models and find most models score below 50% average; InstructBLIP leads (~53% overall) but gaps remain on fine-grained spatial relations, OCR, and temporal reasoning. The repo and leaderboard are public.

Problem Statement

Existing multimodal evaluations are small, subjective, or rely on free-form scoring by humans/GPT, making comparisons noisy. SEED-Bench builds a large, objective multiple-choice test (image+video) to measure generative comprehension across targeted dimensions.

Main Contribution

A large-scale, human-verified multiple-choice benchmark of 19K questions spanning 12 spatial and temporal dimensions.

A generation pipeline that extracts visual text (captions, instance descriptions, OCR), uses ChatGPT/GPT-4 to draft questions, filters non-visual questions automatically, then applies human verification.

An evaluation of 18 multimodal models using likelihood-based answer ranking and a public leaderboard to track progress.

Key Findings

SEED-Bench contains 19K human-verified multiple-choice questions across 12 dimensions.

Numbers19,242 questions; 12 dimensions

5.52% of autogenerated questions were answerable without the image and were filtered out.

Numbers5.52% filtered

Top model (InstructBLIP variants) reaches only about 53% overall accuracy on SEED-Bench.

NumbersInstructBLIP (Vicuna) overall 53.37% accuracy

Most evaluated MLLMs have average accuracy below 50% across dimensions.

NumbersMajority <50% averaged accuracy

Image-based InstructBLIP outperforms VideoLLMs on temporal tasks in this benchmark.

NumbersInstructBLIP temporal 38.31% vs VideoChat temporal 33.68%

OCR/text-recognition is a weak spot: most models score under 40% on text understanding.

NumbersText recognition accuracy <40% for most models; 85 samples in that split

Results

Benchmark size

Value19,242 multiple-choice questions

Accuracy

Value53.37%

Accuracy

Value≈27% average

Automatic filtering rate

Value5.52%

Text-recognition split size

Value85 questions

Largest split (instance localization)

Value4,649 questions

Temporal task example

ValueInstructBLIP temporal 38.31% vs VideoChat 33.68%

Who Should Care

What To Try In 7 Days

Run your model on SEED-Bench to find weak dimensions (OCR, relations, temporal).

Adopt likelihood-based ranking for multiple-choice evaluation to avoid label-formatting bugs.

If OCR matters, add specialized OCR preprocessing or fine-tune on text-rich image data and re-evaluate.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Temporal questions rely on dataset ground-truth rather than automatic video captioning, so automatic scalability for video QA is limited (Sec.3.3).
  • Text-recognition split is small (85 samples), so OCR conclusions have higher variance (Sec.3.3).
  • Generated questions come from ChatGPT/GPT-4 and foundation models; any generation bias propagates into the benchmark.

When Not To Use

  • Do not use SEED-Bench as the only safety or fairness test for production systems.
  • Avoid using it as a proxy for domain-specific video tasks that need specialized sensors or long-range context.

Failure Modes

  • Models may guess common-sense answers without using image evidence; automatic filtering reduces but does not eliminate this.
  • Some dimensions have uneven sample sizes, which can skew overall averages if not weighted (e.g., 4,649 vs 85).
  • Benchmark generation tied to current foundation models may miss rare or adversarial visual phenomena.

Core Entities

Models

  • Flan-T5
  • Vicuna
  • LLaMA
  • BLIP-2
  • InstructBLIP
  • InstructBLIP Vicuna
  • LLaVA
  • MiniGPT-4
  • VPGTrans
  • MultiModal-GPT
  • Otter
  • OpenFlamingo
  • LLaMA-Adapter V2
  • GVT
  • mPLUG-Owl
  • VideoChat
  • Video-ChatGPT
  • Valley

Metrics

  • Accuracy
  • Rank

Datasets

  • CC3M
  • Something-Something-v2
  • Epic-Kitchen 100
  • Breakfast

Benchmarks

  • MME
  • MMBench
  • LVLM-eHub
  • LAMM