Multimodal Benchmarks Papers — Parsed & Scored for Practitioners

A broad third-party benchmark shows ChatGPT is a strong zero-shot performer but an unreliable reasoner and prone to hallucination

0.70

0.25

0.75

352

ChatGPT is a practical zero-shot workhorse: it saves time on many tasks and can replace some fine-tuned models for quick proofs of concept, but its factual and reasoning errors mean you must validate outputs before customer-facing or safety-critical use.

Key finding

ChatGPT often outperforms prior zero-shot LLMs.

Numbers: 9/13 evaluated datasets (zero-shot comparisons)

Two-stage multimodal chain-of-thought lets sub‑1B models reason with images and text

0.60

0.45

96

You can get near-state-of-the-art multimodal reasoning with lightweight models by fine-tuning in two stages and fusing image features—this reduces hallucination and lowers compute cost versus running large multimodal LLMs.

Key finding

Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.

Numbers: No-CoT 81.63% vs Reasoning 69.32% (↓12.31pp)

A concise roadmap to multimodal LLMs: architectures, training recipes, evaluation, hallucination, and extensions

0.60

0.40

0.60

85

MLLMs let products combine vision and language: build image-aware assistants, document parsers, or multimodal agents. Focus on data quality, connector design, and safe alignment to reduce hallucinations before shipping.

Key finding

MLLMs are typically built from three modules: a pre-trained modality encoder, a pre-trained LLM, and a connector between them.

SEED-Bench: a 19K, 12-dimension multiple-choice benchmark for testing image and video LLM comprehension

0.40

0.45

0.30

52

SEED-Bench gives a large, objective test to reveal real weaknesses in multimodal models (OCR, spatial relations, temporal reasoning), so businesses should validate models on similar slices before deploying image/video features.

Key finding

SEED-Bench contains 19K human-verified multiple-choice questions across 12 dimensions.

Numbers: 19,242 questions; 12 dimensions

GPT-4 can pass Japan's medical licensing exam but shows costly localization and safety gaps

0.50

0.40

0.70

50

LLMs can meet exam-level MCQ performance in non-English, specialized domains but need localization, safety filters, and higher budget due to tokenization and legal differences.

Key finding

GPT-4 passes all six years of the Japanese medical licensing exam (2018–2023) in closed-book multiple-choice format.

Numbers: 2018: required 161, general 221 (passing 160/208); Table 1

M3Exam: 12k official exam questions in 9 languages (23% with images) to stress-test LLMs' multilingual and multimodal skills

0.70

0.60

0.40

31

M3Exam reveals real-world gaps in multilingual and multimodal LLMs: expect failures on low-resource languages and complex images, so validate models on representative data before deployment.

Key finding

M3Exam totals 12,317 multiple-choice questions across 9 languages.

Numbers: 12,317 total questions; 9 languages

Fine-tuned small LLMs (HealthAlpaca) can match or beat much larger models on wearable-sensor health tasks

0.40

0.55

0.45

28

You can build cheaper, open LLM-based health prediction services by fine-tuning a modest-size model on combined wearable datasets and using richer prompts, avoiding reliance on expensive closed models for many consumer tasks.

Key finding

Fine-tuned HealthAlpaca achieves top performance on most tasks.

Numbers: Best result in 8 out of 10 tasks (reported across experiments).

A practical map of how knowledge graphs and multimodal AI fit together today and where to push next

0.60

0.50

0.60

28

Adding structured knowledge to multimodal systems improves accuracy, interpretability, and long-tail reasoning. That helps applications like search, recommendation, product QA, and compliance where factual grounding and rare facts matter.

Key finding

The survey covers more than 300 related papers.

Numbers: ‘over 300 articles’ (abstract)

ChartLlama: a multimodal LLM trained on GPT‑4‑synthesized chart data for chart understanding and generation

0.60

0.70

0.50

25

Companies that need automated reading, generation, or editing of charts can improve accuracy and add code-generation features by training multimodal models on synthetic, code‑paired chart datasets.

Key finding

ChartLlama improves ChartQA accuracy versus prior open models on evaluated splits.

Numbers: ChartQA average: ChartLlama 69.66 vs Unichart 66.24 (Table 2/5)

Woodpecker: a training-free post-hoc pipeline that finds and fixes image hallucinations with vision experts

0.60

0.70

22

You can reduce image-based hallucinations and raise trust without retraining models by adding a post-hoc verifier that extracts claims, checks them with detectors/VQA, and rewrites outputs with bounding-box evidence.

Key finding

Applying Woodpecker to MiniGPT-4 increased POPE object-existence accuracy from 54.67% to 85.33%.

Numbers: 54.67% → 85.33% (Δ +30.66)

A better visual tokenizer lets language models match or beat diffusion models on ImageNet and video tasks

0.60

0.70

0.60

21

A better visual tokenizer can make language-model pipelines produce higher-quality images/videos with fewer inference steps and offer a new compressed token format that speeds downstream generation and saves bandwidth.

Key finding

On ImageNet 512×512 class-conditional generation, MLM + MAGVIT-v2 achieved FID 1.91 with guidance versus diffusion baseline VDM++ FID 2.65.

Numbers: FID 1.91 vs 2.65 (512×512); 28% relative improvement

LVLM-eHub: a practical benchmark and human arena to measure large vision-language models across six multimodal capabilities

0.50

0.55

0.45

20

Benchmark scores can be misleading: high in-domain numbers often mean overfitting and worse open-world behavior; evaluate models with human-in-the-loop tests and targeted hallucination probes before deploying.

Key finding

Instruction-tuned models trained on massive in-domain data (InstructBLIP) score highest on many standard benchmarks but generalize poorly in open-world human evaluations.

Numbers: InstructBLIP avg. scores: Visual Knowledge 0.967 (Table 3); Perception avg. 0.928 (Table 2); Arena rank lower in open‑世界

Q-Bench: a focused benchmark that tests multimodal LLMs on low-level image perception, description, and human-aligned quality scoring

0.50

0.45

0.30

20

MLLMs already detect many low-level image attributes and can produce human-correlated quality scores with a simple softmax trick; businesses can use them for scalable, early-stage quality triage and content moderation but should not replace specialist QA for fine-grained tasks.

Key finding

MLLMs show non-random perception ability but lag behind expert humans.

Numbers: InternLM-XComposer-VL overall accuracy 64.35%; GPT-4V 73.36%; Senior human 81.74% (LLVisionQA test)

Survey: where multimodal LLMs stand on reasoning, benchmarks, training recipes, and gaps

0.40

0.30

0.40

19

If your product needs reliable multimodal reasoning (e.g., visual QA, robotics planning), current models vary widely; instruction tuning and careful training stages materially improve results, but proprietary models still lead.

Key finding

Proprietary multimodal models outperform open-source models on reasoning-focused benchmarks.

Numbers: InfiMM-Eval overall: GPT-4V 74.44 vs InfiMM-LLaMA-13B 40.7

ControlBench: evaluate GPT-4, Claude 3 Opus, Gemini on 147 undergraduate control problems

0.40

0.60

0.50

19

Text LLMs can help generate control designs and explanations quickly, but they commonly make calculation and plot-reading errors, so use them for drafts and human-in-the-loop workflows, not final safety-critical designs.

Key finding

Claude 3 Opus outperforms GPT-4 and Gemini on ControlBench.

Numbers: ACC 58.5% (86/147), ACC-s 68.7% (101/147)

AgentClinic: interactive, multimodal simulations that stress-test LLMs on real-style clinical decision making

0.30

0.70

0.40

18

Static medical QA overstates real-world performance. Interactive, multimodal tests reveal gaps in data gathering, tool use, and bias handling that directly affect safety and product trust.

Key finding

Interactive, sequential format is harder than static QA.

Numbers: Diagnostic accuracy can fall below 10% of static baseline (paper statement).

Train a vision-language model to read and reason across many images in one prompt

0.60

0.70

0.50

18

If your product must reason over multiple images together (multi-photo chat, visual QA over albums, video snapshots), MMICL-style models reduce hallucinations and improve multi-image reasoning by adding explicit image tokens and multi-image instruction tuning.

Key finding

MMICL improves matching of captions to images on compositional image/text puzzles (Winoground).

Numbers: Text 45 / Image 45 / Group 43 (MMICL FLAN-T5-XXL, Table 2)

Systematic evaluation of GPT-4V and LLaVA on 1000+ vision+text engineering design tasks

0.30

0.60

0.50

17

VLMs like GPT-4V can speed up low-value, repetitive visual tasks (sketch similarity, captioning with handwriting) and help populate searchable design catalogs, but they currently cannot replace engineering checks that need precise spatial, numeric, or manufacturability guarantees.

Key finding

GPT-4V matches or exceeds human raters on sketch-similarity triplet tests.

Numbers: Self-consistency 94%; transitive violations = 5 (best human = 5).

MiniGPT-5: fuse an LLM with Stable Diffusion using 'generative vokens' for interleaved image+text outputs

0.60

0.70

0.60

16

MiniGPT-5 lets a single system produce coherent text and images together, cutting the need for separate caption→image pipelines and reducing integration overhead, while training only a small set of parameters.

Key finding

Humans prefer MiniGPT-5 outputs over a two-stage baseline on multimodal story tasks.

Numbers: Language continuity 55.22% vs 34.89%; image quality 52.43% vs 37.79%; multimodal coherence 56.9% vs 28.88%

Survey of 126 multimodal LLMs: architectures, training recipes, benchmarks, and next steps

0.70

0.45

0.65

15

You can add vision, audio, or other modalities to existing LLMs cheaply by training small projectors or PEFT adapters, unlocking richer user interactions without retraining huge models.

Key finding

Most MM-LLMs add small adapters while keeping the core LLM frozen.

Numbers: Trainable params typically ≈2% (projectors only); PEFT can be <0.1%

Comprehensive eval finds Gemini close to GPT‑3.5 on language commonsense, behind GPT‑4 and GPT‑4V on multimodal tasks

0.65

0.25

0.40

14

Gemini Pro is close to GPT‑3.5 for language commonsense but behind GPT‑4; pick models based on accuracy needs and multimodal complexity.

Key finding

Gemini Pro's language-only accuracy is similar to GPT‑3.5 Turbo.

Numbers: Avg acc Gemini Pro 79.2% vs GPT‑3.5 78.2% on 11 language datasets

Early comparison shows Google Gemini Pro is a close challenger to GPT-4V on multimodal understanding, with different strengths and common ML

0.65

0.45

0.60

13

Gemini Pro is a practical, competitive alternative to GPT-4V for many multimodal products; choose the model that matches task needs (cognition/code vs concise multi-domain answers) and test spatial/OCR edge cases before deployment.

Key finding

Gemini narrowly outscored GPT-4V on the MME benchmark overall.

Numbers: Gemini 1933.4 vs GPT-4V 1926.6 overall (MME, higher better)

MAIRA-2: a multimodal chest X‑ray model that generates grounded findings and RadFact, an LLM-based sentence-level evaluator

0.40

0.60

0.35

13

MAIRA-2 can produce editable, locally-grounded draft radiology findings and an LLM-based evaluator (RadFact); this shortens reviewer effort and supports rapid prototyping of clinical draft-assist tools, but human oversight remains mandatory.

Key finding

MAIRA-2 achieves strong lexical and clinical gains on MIMIC-CXR compared to earlier systems.

Numbers: ROUGE-L 38.4; BLEU-4 23.1; RadGraph-F1 34.6 (Table D.1)

Fine-tune a vision+LLM for medical VQA and report writing using LoRA and one projection layer

0.50

0.60

0.70

11

You can adapt large vision+LLM models to medical VQA and report generation cheaply by tuning small adapter layers; use GPT-4 for scalable semantic QA evaluation instead of brittle lexical metrics.

Key finding

PEFT keeps trainable footprint tiny: only projection + LoRA updated.

Numbers: 56.63M trainable vs 7B full LLM