Overview
Production Readiness
0.4
Novelty Score
0.45
Cost Impact Score
0.3
Citation Count
52
Why It Matters For Business
SEED-Bench gives a large, objective test to reveal real weaknesses in multimodal models (OCR, spatial relations, temporal reasoning), so businesses should validate models on similar slices before deploying image/video features.
Summary TLDR
SEED-Bench is a large objective benchmark for multimodal LLMs. It provides 19K human-verified multiple-choice questions across 12 spatial and temporal dimensions (images + videos). Questions are generated with foundation models and ChatGPT/GPT-4, automatically filtered (5.52% removed) and human-checked. The authors evaluate 18 models and find most models score below 50% average; InstructBLIP leads (~53% overall) but gaps remain on fine-grained spatial relations, OCR, and temporal reasoning. The repo and leaderboard are public.
Problem Statement
Existing multimodal evaluations are small, subjective, or rely on free-form scoring by humans/GPT, making comparisons noisy. SEED-Bench builds a large, objective multiple-choice test (image+video) to measure generative comprehension across targeted dimensions.
Main Contribution
A large-scale, human-verified multiple-choice benchmark of 19K questions spanning 12 spatial and temporal dimensions.
A generation pipeline that extracts visual text (captions, instance descriptions, OCR), uses ChatGPT/GPT-4 to draft questions, filters non-visual questions automatically, then applies human verification.
An evaluation of 18 multimodal models using likelihood-based answer ranking and a public leaderboard to track progress.
Key Findings
SEED-Bench contains 19K human-verified multiple-choice questions across 12 dimensions.
5.52% of autogenerated questions were answerable without the image and were filtered out.
Top model (InstructBLIP variants) reaches only about 53% overall accuracy on SEED-Bench.
Most evaluated MLLMs have average accuracy below 50% across dimensions.
Image-based InstructBLIP outperforms VideoLLMs on temporal tasks in this benchmark.
OCR/text-recognition is a weak spot: most models score under 40% on text understanding.
Results
Benchmark size
Accuracy
Accuracy
Automatic filtering rate
Text-recognition split size
Largest split (instance localization)
Temporal task example
Who Should Care
What To Try In 7 Days
Run your model on SEED-Bench to find weak dimensions (OCR, relations, temporal).
Adopt likelihood-based ranking for multiple-choice evaluation to avoid label-formatting bugs.
If OCR matters, add specialized OCR preprocessing or fine-tune on text-rich image data and re-evaluate.
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Temporal questions rely on dataset ground-truth rather than automatic video captioning, so automatic scalability for video QA is limited (Sec.3.3).
- Text-recognition split is small (85 samples), so OCR conclusions have higher variance (Sec.3.3).
- Generated questions come from ChatGPT/GPT-4 and foundation models; any generation bias propagates into the benchmark.
When Not To Use
- Do not use SEED-Bench as the only safety or fairness test for production systems.
- Avoid using it as a proxy for domain-specific video tasks that need specialized sensors or long-range context.
Failure Modes
- Models may guess common-sense answers without using image evidence; automatic filtering reduces but does not eliminate this.
- Some dimensions have uneven sample sizes, which can skew overall averages if not weighted (e.g., 4,649 vs 85).
- Benchmark generation tied to current foundation models may miss rare or adversarial visual phenomena.
Core Entities
Models
- Flan-T5
- Vicuna
- LLaMA
- BLIP-2
- InstructBLIP
- InstructBLIP Vicuna
- LLaVA
- MiniGPT-4
- VPGTrans
- MultiModal-GPT
- Otter
- OpenFlamingo
- LLaMA-Adapter V2
- GVT
- mPLUG-Owl
- VideoChat
- Video-ChatGPT
- Valley
Metrics
- Accuracy
- Rank
Datasets
- CC3M
- Something-Something-v2
- Epic-Kitchen 100
- Breakfast
Benchmarks
- MME
- MMBench
- LVLM-eHub
- LAMM

