Survey: where multimodal LLMs stand on reasoning, benchmarks, training recipes, and gaps

Overview

Decision SnapshotNeeds Validation

The paper compiles practical, evidence-backed recipes (instruction tuning, selective unfreezing, stronger visual encoders, multi-task supervised stage) and shows clear dataset-driven gaps; use these recipes cautiously and validate on reasoning-step benchmarks.

Citations19

Evidence Strength0.65

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 30%

Authors

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang

Links

Abstract / PDF

Why It Matters For Business

If your product needs reliable multimodal reasoning (e.g., visual QA, robotics planning), current models vary widely; instruction tuning and careful training stages materially improve results, but proprietary models still lead.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

This is a focused survey of how multimodal large language models (MLLMs) are evaluated and trained for reasoning. The authors define multimodal reasoning, review existing benchmarks (many not reasoning-focused), compare model recipes and results (GPT-4V leads by a large margin), and list practical training choices that help reasoning: instruction tuning, optionally unfreezing the LLM, multi-task supervised stage, and stronger visual encoders. The paper flags gaps: benchmark design, hallucination, catastrophic forgetting, and limited long-context evaluation.

Problem Statement

Current MLLMs show fluent multimodal output but their true reasoning ability is unclear. Benchmarks and training recipes vary and often do not measure reasoning steps. We need a clear evaluation standard and practical training guidelines to improve multimodal reasoning.

Main Contribution

Define multimodal reasoning and categorize common reasoning types used in MLLM work (deductive, abductive, analogical).

Survey MLLM architectures, training stages, and connectors (visual encoder + connector + LLM).

Key Findings

Proprietary multimodal models outperform open-source models on reasoning-focused benchmarks.

NumbersInfiMM-Eval overall: GPT-4V 74.44 vs InfiMM-LLaMA-13B 40.7

Practical UseExpect a large capability gap; use GPT-4V (or similar proprietary models) for high-stakes multimodal reasoning today or adopt the training recipes below to narrow the gap.

Evidence RefTable 5, Sec.6

Instruction tuning significantly improves multimodal reasoning scores.

NumbersQwen-VL-7B: 21.32 -> Qwen-VL-7B-Chat: 33.44 (InfiMM-Eval overall)

Practical UseAdd a final instruction-finetuning stage (mixed high-quality multimodal instruction data) to boost open-set reasoning performance quickly.

Evidence RefTable 6, Sec.6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
InfiMM-Eval overall score	74.44	—	—	InfiMM-Eval	Table 5: GPT-4V overall 74.44	Table 5
InfiMM-Eval overall score (open-source)	40.7	GPT-4V 74.44	-33.74	InfiMM-Eval	Table 5: InfiMM-LLaMA-13B overall 40.7	Table 5

What To Try In 7 Days

Run InfiMM-Eval (or a reasoning-step subset) on your model to measure true multimodal reasoning.

Add a small instruction-finetuning pass using public multimodal instruction mixes (MIC, MIMIC-IT) and re-evaluate.

Experiment unfreezing the LLM for a few controlled steps with low LR to boost cross-modal integration while monitoring language tasks for forgetting.

Agent Features

Memory

world-state memory (reader/writer)

Planning

Planner-Actor-ReporterReActChain-of-Thought

Tool Use

Visual ChatGPT-style tool chainingProgram-based tool orchestration (VISPROG)

Frameworks

MIMIC-ITMICInstructBLIP/visual instruction tuning

Architectures

visual-encoder + connector + LLMquery-based connector (Q-Former)cross-attention connector (perceiver resampler)

Collaboration

Socratic Models (model composition)

Optimization Features

Training Optimization

unfreeze LLM selectivelymulti-task supervised stage

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Survey depends on public papers and reported leaderboards; direct head-to-head runs are limited.

Many benchmarks summarized are not reasoning-step annotated, limiting causal claims about model reasoning.

When Not To Use

For tasks needing formal, provable logical reasoning where correctness must be guaranteed.

When you require long-context multimodal reasoning beyond current short-context MLLM windows.

Failure Modes

Hallucination from visual or language modules leading to wrong but plausible answers.

Catastrophic forgetting of language-only capabilities after aggressive visual instruction fine-tuning.

Core Entities

Models

GPT-4VQwen-VL-ChatInfiMM-LLaMA-13BSPHINX-v2CogVLM-ChatMiniGPT-4BLIP-2LLaVA-1.5InstructBLIPOttermPLUG-Owl2

Metrics

AccuracyGPT-4 evaluationCaption ScoreElo score

Datasets

InfiMM-EvalMMMUMM-VetScienceQAVQAv2GQAOK-VQAMMBenchLLM-eHubSparklesEvalHallusionBenchMathVista

Benchmarks

InfiMM-EvalMMMUMM-VetHallusionBenchMathVistaSparklesEval

Context Entities

Models

FlamingoBLIP-2LLaMAPaLM-ERT-2

Metrics

BLEUCIDErROUGE

Datasets

COCO captionFlickr30KVisual GenomeLAIONCC3M

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Proprietary multimodal models outperform open-source models on reasoning-focused benchmarks.

Instruction tuning significantly improves multimodal reasoning scores.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-