Survey: where multimodal LLMs stand on reasoning, benchmarks, training recipes, and gaps

January 10, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.4

Citation Count

19

Authors

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang

Links

Abstract / PDF

Why It Matters For Business

If your product needs reliable multimodal reasoning (e.g., visual QA, robotics planning), current models vary widely; instruction tuning and careful training stages materially improve results, but proprietary models still lead.

Summary TLDR

This is a focused survey of how multimodal large language models (MLLMs) are evaluated and trained for reasoning. The authors define multimodal reasoning, review existing benchmarks (many not reasoning-focused), compare model recipes and results (GPT-4V leads by a large margin), and list practical training choices that help reasoning: instruction tuning, optionally unfreezing the LLM, multi-task supervised stage, and stronger visual encoders. The paper flags gaps: benchmark design, hallucination, catastrophic forgetting, and limited long-context evaluation.

Problem Statement

Current MLLMs show fluent multimodal output but their true reasoning ability is unclear. Benchmarks and training recipes vary and often do not measure reasoning steps. We need a clear evaluation standard and practical training guidelines to improve multimodal reasoning.

Main Contribution

Define multimodal reasoning and categorize common reasoning types used in MLLM work (deductive, abductive, analogical).

Survey MLLM architectures, training stages, and connectors (visual encoder + connector + LLM).

Review instruction tuning and multimodal prompting methods that target reasoning and in-context learning.

Compare models on a subset of multimodal reasoning benchmarks and extract practical recipes and failure modes.

Outline open problems and future directions such as benchmark design, long-context support, and RLHF for multimodal models.

Key Findings

Proprietary multimodal models outperform open-source models on reasoning-focused benchmarks.

NumbersInfiMM-Eval overall: GPT-4V 74.44 vs InfiMM-LLaMA-13B 40.7

Instruction tuning significantly improves multimodal reasoning scores.

NumbersQwen-VL-7B: 21.32 -> Qwen-VL-7B-Chat: 33.44 (InfiMM-Eval overall)

A three-stage training recipe and unfreezing the LLM correlate with top open-source performance.

NumbersTop open-source InfiMM-Eval scores: SPHINX-v2 39.48, InfiMM-LLaMA-13B 40.7, Qwen-VL-Chat 37.39

Most multimodal benchmarks lack step-level reasoning annotations and are not designed specifically for reasoning.

NumbersInfiMM-Eval is 279 step-annotated samples; many other datasets lack full reasoning steps

Multimodal instruction fine-tuning can cause loss of pure text reasoning (catastrophic forgetting) if done improperly.

NumbersMMMU shows proprietary models (GPT-4V 55.7) outperform many open-source MLLMs that may have reduced language-only recall

Results

InfiMM-Eval overall score

Value74.44

InfiMM-Eval overall score (open-source)

Value40.7

BaselineGPT-4V 74.44

InfiMM-Eval instruction tuning effect

Value21.32 -> 33.44

Who Should Care

What To Try In 7 Days

Run InfiMM-Eval (or a reasoning-step subset) on your model to measure true multimodal reasoning.

Add a small instruction-finetuning pass using public multimodal instruction mixes (MIC, MIMIC-IT) and re-evaluate.

Experiment unfreezing the LLM for a few controlled steps with low LR to boost cross-modal integration while monitoring language tasks for forgetting.

Agent Features

Memory

  • world-state memory (reader/writer)

Planning

  • Planner-Actor-Reporter
  • ReAct
  • Chain-of-Thought

Tool Use

  • Visual ChatGPT-style tool chaining
  • Program-based tool orchestration (VISPROG)

Frameworks

  • MIMIC-IT
  • MIC
  • InstructBLIP/visual instruction tuning

Architectures

  • visual-encoder + connector + LLM
  • query-based connector (Q-Former)
  • cross-attention connector (perceiver resampler)

Collaboration

  • Socratic Models (model composition)

Optimization Features

Training Optimization

  • unfreeze LLM selectively
  • multi-task supervised stage

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Survey depends on public papers and reported leaderboards; direct head-to-head runs are limited.
  • Many benchmarks summarized are not reasoning-step annotated, limiting causal claims about model reasoning.
  • Quantitative comparisons mix models of different compute budgets and undisclosed proprietary training, reducing attribution precision.

When Not To Use

  • For tasks needing formal, provable logical reasoning where correctness must be guaranteed.
  • When you require long-context multimodal reasoning beyond current short-context MLLM windows.
  • If auditability of intermediate reasoning steps is required but your chosen benchmark lacks step annotations.

Failure Modes

  • Hallucination from visual or language modules leading to wrong but plausible answers.
  • Catastrophic forgetting of language-only capabilities after aggressive visual instruction fine-tuning.
  • Sensitivity to prompt format and answer permutations in multiple-choice setups.

Core Entities

Models

  • GPT-4V
  • Qwen-VL-Chat
  • InfiMM-LLaMA-13B
  • SPHINX-v2
  • CogVLM-Chat
  • MiniGPT-4
  • BLIP-2
  • LLaVA-1.5
  • InstructBLIP
  • Otter
  • mPLUG-Owl2

Metrics

  • Accuracy
  • GPT-4 evaluation
  • Caption Score
  • Elo score

Datasets

  • InfiMM-Eval
  • MMMU
  • MM-Vet
  • ScienceQA
  • VQAv2
  • GQA
  • OK-VQA
  • MMBench
  • LLM-eHub
  • SparklesEval
  • HallusionBench
  • MathVista

Benchmarks

  • InfiMM-Eval
  • MMMU
  • MM-Vet
  • HallusionBench
  • MathVista
  • SparklesEval

Context Entities

Models

  • Flamingo
  • BLIP-2
  • LLaMA
  • PaLM-E
  • RT-2

Metrics

  • BLEU
  • CIDEr
  • ROUGE

Datasets

  • COCO caption
  • Flickr30K
  • Visual Genome
  • LAION
  • CC3M