Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
SimpleVQA surfaces concrete factual weaknesses in vision-capable LLMs, reducing risk if you rely on multimodal outputs for decisions; run it to find brittle visual facts before deployment.
Summary TLDR
SimpleVQA is a bilingual (Chinese/English) visual question-answering benchmark of 2,025 short, fact-seeking image Q&As designed to probe factuality in multimodal LLMs. It focuses on time-stable, verifiable facts across 9 tasks and 9 domains. The dataset was curated with LLM-assisted generation, multi-stage human verification, and difficulty filtering. The authors evaluate 18 multimodal LLMs and 8 text-only LLMs, show broad factual weaknesses (many models score in the 30–56% F-score range), and introduce atomic-fact probing to separate failures due to poor visual recognition from missing internal knowledge.
Problem Statement
Current benchmarks measure vision-language skills but miss short-form factuality that depends on both image understanding and stored knowledge. The field lacks a concise, verifiable VQA set that: (a) targets time-stable facts; (b) is bilingual; (c) isolates visual comprehension vs. internal knowledge failures.
Main Contribution
SimpleVQA: a curated bilingual VQA benchmark with 2,025 short fact questions and static reference answers.
Nine task categories and nine domain topics to cover varied factual queries (e.g., object ID, time/event, text processing).
A multi-stage pipeline: seed collection, GPT-4o-assisted refinement, LLM screening, multi-annotator human checks, and difficulty filtering.
An atomic-fact probing protocol that turns failures into smaller "atomic" questions to distinguish visual recognition vs. knowledge gaps.
A large-scale evaluation: 18 MLLMs and 8 text-only LLMs scored with an LLM-as-a-judge setup and human verification.
Key Findings
SimpleVQA contains 2,025 high-quality Q&A samples across 9 tasks and 9 domains.
Most evaluated MLLMs show modest factual accuracy on this benchmark.
Atomic-fact hints often boost final answer accuracy, separating visual vs. knowledge failures.
The dataset was purposely filtered for difficulty using other MLLMs.
Annotation costs and verification were non-trivial but affordable.
Results
SimpleVQA dataset size
Models evaluated
Top reported F-score (approx.)
CFQ (atomic-given) improvement
Who Should Care
What To Try In 7 Days
Run SimpleVQA on your model to get a quick factuality baseline.
Use atomic-fact probing: convert failing items into smaller visual facts and retest to locate failure cause.
Add focused SFT (supervised fine-tuning) data for visual labels found weak by SimpleVQA and re-evaluate performance delta.
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Relatively small: 2,025 examples may not cover all visual facts and long-tail cases.
- No public dataset URL or code in paper text; replicability may require contacting authors.
- Benchmark intentionally filters out easy items, so raw accuracy may understate general competence on everyday VQA.
- LLM-as-judge can introduce evaluator bias; they used GPT-4o which may align with models used in construction.
When Not To Use
- Not ideal when you need broad-multimodal reasoning tasks or long-form answers.
- Not suitable for political, safety, or time-sensitive evaluations (excluded by design).
- Not a replacement for large-scale domain-specific factual audits due to size limits.
Failure Modes
- Visual recognition errors (object/person ID, text-in-image)—detectable via atomic probes.
- Missing internalized knowledge even when visual cue is provided (small atomic-given gains).
- Overconfidence and hallucination despite low factual accuracy (calibration issues shown).
Core Entities
Models
- GPT-4o
- GPT-4o-mini
- Gemini-2.0-flash
- Claude-3.5-Sonnet
- Qwen2.5-VL-72B-Instruct
- InternVL2.5-78B-MPO
- Doubao-vision-pro-128k
- ERNIE-VL
Metrics
- CO
- NA
- IN
- CGA
- F-score
Datasets
- SimpleVQA
- SimpleQA
- Chinese SimpleQA
- MMbench
- MMVet
- MME
- Dynamath
- MMbench_CN
- CCBench
Benchmarks
- SimpleVQA
- SimpleQA
- Chinese SimpleQA
- MMbench
- Dynamath
Context Entities
Models
- Qwen-Max
- InternVL2-Llama3-76B
- Janus-pro-7B
- DeepSeek-R1
Metrics
- Accuracy
- LLM-as-a-Judge
Datasets
- MM-Vet
- MMMU
- MMMU-Pro
Benchmarks
- ChineseFactEval
- AGI-Eval
- C-Eval

