Overview
The benchmark is ready as a focused evaluation tool; results are supported by multi-model tables and human checks, but public code and dataset URLs are not provided in the paper.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
SimpleVQA surfaces concrete factual weaknesses in vision-capable LLMs, reducing risk if you rely on multimodal outputs for decisions; run it to find brittle visual facts before deployment.
Who Should Care
Summary TLDR
SimpleVQA is a bilingual (Chinese/English) visual question-answering benchmark of 2,025 short, fact-seeking image Q&As designed to probe factuality in multimodal LLMs. It focuses on time-stable, verifiable facts across 9 tasks and 9 domains. The dataset was curated with LLM-assisted generation, multi-stage human verification, and difficulty filtering. The authors evaluate 18 multimodal LLMs and 8 text-only LLMs, show broad factual weaknesses (many models score in the 30–56% F-score range), and introduce atomic-fact probing to separate failures due to poor visual recognition from missing internal knowledge.
Problem Statement
Current benchmarks measure vision-language skills but miss short-form factuality that depends on both image understanding and stored knowledge. The field lacks a concise, verifiable VQA set that: (a) targets time-stable facts; (b) is bilingual; (c) isolates visual comprehension vs. internal knowledge failures.
Main Contribution
SimpleVQA: a curated bilingual VQA benchmark with 2,025 short fact questions and static reference answers.
Nine task categories and nine domain topics to cover varied factual queries (e.g., object ID, time/event, text processing).
Key Findings
SimpleVQA contains 2,025 high-quality Q&A samples across 9 tasks and 9 domains.
Most evaluated MLLMs show modest factual accuracy on this benchmark.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| SimpleVQA dataset size | 2,025 samples | — | — | — | Dataset constructed and reported in Sec 2 and Table 2 | Section 2.1, Table 2 |
| Models evaluated | 18 multimodal LLMs; 8 text-only LLMs | — | — | — | Evaluation setup described in Sec 3.2 and Appendix F | Section 3.2, Appendix F |
What To Try In 7 Days
Run SimpleVQA on your model to get a quick factuality baseline.
Use atomic-fact probing: convert failing items into smaller visual facts and retest to locate failure cause.
Add focused SFT (supervised fine-tuning) data for visual labels found weak by SimpleVQA and re-evaluate performance delta.
Reproducibility
Risks & Boundaries
Limitations
Relatively small: 2,025 examples may not cover all visual facts and long-tail cases.
No public dataset URL or code in paper text; replicability may require contacting authors.
When Not To Use
Not ideal when you need broad-multimodal reasoning tasks or long-form answers.
Not suitable for political, safety, or time-sensitive evaluations (excluded by design).
Failure Modes
Visual recognition errors (object/person ID, text-in-image)—detectable via atomic probes.
Missing internalized knowledge even when visual cue is provided (small atomic-given gains).

