Survey: where multimodal LLMs stand on reasoning, benchmarks, training recipes, and gaps

January 10, 20247 min

Overview

Decision SnapshotNeeds Validation

The paper compiles practical, evidence-backed recipes (instruction tuning, selective unfreezing, stronger visual encoders, multi-task supervised stage) and shows clear dataset-driven gaps; use these recipes cautiously and validate on reasoning-step benchmarks.

Citations19

Evidence Strength0.65

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 30%

Authors

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang

Links

Abstract / PDF

Why It Matters For Business

If your product needs reliable multimodal reasoning (e.g., visual QA, robotics planning), current models vary widely; instruction tuning and careful training stages materially improve results, but proprietary models still lead.

Who Should Care

Summary TLDR

This is a focused survey of how multimodal large language models (MLLMs) are evaluated and trained for reasoning. The authors define multimodal reasoning, review existing benchmarks (many not reasoning-focused), compare model recipes and results (GPT-4V leads by a large margin), and list practical training choices that help reasoning: instruction tuning, optionally unfreezing the LLM, multi-task supervised stage, and stronger visual encoders. The paper flags gaps: benchmark design, hallucination, catastrophic forgetting, and limited long-context evaluation.

Problem Statement

Current MLLMs show fluent multimodal output but their true reasoning ability is unclear. Benchmarks and training recipes vary and often do not measure reasoning steps. We need a clear evaluation standard and practical training guidelines to improve multimodal reasoning.

Main Contribution

Define multimodal reasoning and categorize common reasoning types used in MLLM work (deductive, abductive, analogical).

Survey MLLM architectures, training stages, and connectors (visual encoder + connector + LLM).

Key Findings

Proprietary multimodal models outperform open-source models on reasoning-focused benchmarks.

NumbersInfiMM-Eval overall: GPT-4V 74.44 vs InfiMM-LLaMA-13B 40.7

Practical UseExpect a large capability gap; use GPT-4V (or similar proprietary models) for high-stakes multimodal reasoning today or adopt the training recipes below to narrow the gap.

Evidence RefTable 5, Sec.6

Instruction tuning significantly improves multimodal reasoning scores.

NumbersQwen-VL-7B: 21.32 -> Qwen-VL-7B-Chat: 33.44 (InfiMM-Eval overall)

Practical UseAdd a final instruction-finetuning stage (mixed high-quality multimodal instruction data) to boost open-set reasoning performance quickly.

Evidence RefTable 6, Sec.6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
InfiMM-Eval overall score74.44InfiMM-EvalTable 5: GPT-4V overall 74.44Table 5
InfiMM-Eval overall score (open-source)40.7GPT-4V 74.44-33.74InfiMM-EvalTable 5: InfiMM-LLaMA-13B overall 40.7Table 5

What To Try In 7 Days

Run InfiMM-Eval (or a reasoning-step subset) on your model to measure true multimodal reasoning.

Add a small instruction-finetuning pass using public multimodal instruction mixes (MIC, MIMIC-IT) and re-evaluate.

Experiment unfreezing the LLM for a few controlled steps with low LR to boost cross-modal integration while monitoring language tasks for forgetting.

Agent Features

Memory
world-state memory (reader/writer)
Planning
Planner-Actor-ReporterReActChain-of-Thought
Tool Use
Visual ChatGPT-style tool chainingProgram-based tool orchestration (VISPROG)
Frameworks
MIMIC-ITMICInstructBLIP/visual instruction tuning
Architectures
visual-encoder + connector + LLMquery-based connector (Q-Former)cross-attention connector (perceiver resampler)
Collaboration
Socratic Models (model composition)

Optimization Features

Training Optimization
unfreeze LLM selectivelymulti-task supervised stage

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Survey depends on public papers and reported leaderboards; direct head-to-head runs are limited.

Many benchmarks summarized are not reasoning-step annotated, limiting causal claims about model reasoning.

When Not To Use

For tasks needing formal, provable logical reasoning where correctness must be guaranteed.

When you require long-context multimodal reasoning beyond current short-context MLLM windows.

Failure Modes

Hallucination from visual or language modules leading to wrong but plausible answers.

Catastrophic forgetting of language-only capabilities after aggressive visual instruction fine-tuning.

Core Entities

Models

GPT-4VQwen-VL-ChatInfiMM-LLaMA-13BSPHINX-v2CogVLM-ChatMiniGPT-4BLIP-2LLaVA-1.5InstructBLIPOttermPLUG-Owl2

Metrics

AccuracyGPT-4 evaluationCaption ScoreElo score

Datasets

InfiMM-EvalMMMUMM-VetScienceQAVQAv2GQAOK-VQAMMBenchLLM-eHubSparklesEvalHallusionBenchMathVista

Benchmarks

InfiMM-EvalMMMUMM-VetHallusionBenchMathVistaSparklesEval

Context Entities

Models

FlamingoBLIP-2LLaMAPaLM-ERT-2

Metrics

BLEUCIDErROUGE

Datasets

COCO captionFlickr30KVisual GenomeLAIONCC3M