Overview
Production Readiness
0.4
Novelty Score
0.3
Cost Impact Score
0.4
Citation Count
19
Why It Matters For Business
If your product needs reliable multimodal reasoning (e.g., visual QA, robotics planning), current models vary widely; instruction tuning and careful training stages materially improve results, but proprietary models still lead.
Summary TLDR
This is a focused survey of how multimodal large language models (MLLMs) are evaluated and trained for reasoning. The authors define multimodal reasoning, review existing benchmarks (many not reasoning-focused), compare model recipes and results (GPT-4V leads by a large margin), and list practical training choices that help reasoning: instruction tuning, optionally unfreezing the LLM, multi-task supervised stage, and stronger visual encoders. The paper flags gaps: benchmark design, hallucination, catastrophic forgetting, and limited long-context evaluation.
Problem Statement
Current MLLMs show fluent multimodal output but their true reasoning ability is unclear. Benchmarks and training recipes vary and often do not measure reasoning steps. We need a clear evaluation standard and practical training guidelines to improve multimodal reasoning.
Main Contribution
Define multimodal reasoning and categorize common reasoning types used in MLLM work (deductive, abductive, analogical).
Survey MLLM architectures, training stages, and connectors (visual encoder + connector + LLM).
Review instruction tuning and multimodal prompting methods that target reasoning and in-context learning.
Compare models on a subset of multimodal reasoning benchmarks and extract practical recipes and failure modes.
Outline open problems and future directions such as benchmark design, long-context support, and RLHF for multimodal models.
Key Findings
Proprietary multimodal models outperform open-source models on reasoning-focused benchmarks.
Instruction tuning significantly improves multimodal reasoning scores.
A three-stage training recipe and unfreezing the LLM correlate with top open-source performance.
Most multimodal benchmarks lack step-level reasoning annotations and are not designed specifically for reasoning.
Multimodal instruction fine-tuning can cause loss of pure text reasoning (catastrophic forgetting) if done improperly.
Results
InfiMM-Eval overall score
InfiMM-Eval overall score (open-source)
InfiMM-Eval instruction tuning effect
Who Should Care
What To Try In 7 Days
Run InfiMM-Eval (or a reasoning-step subset) on your model to measure true multimodal reasoning.
Add a small instruction-finetuning pass using public multimodal instruction mixes (MIC, MIMIC-IT) and re-evaluate.
Experiment unfreezing the LLM for a few controlled steps with low LR to boost cross-modal integration while monitoring language tasks for forgetting.
Agent Features
Memory
- world-state memory (reader/writer)
Planning
- Planner-Actor-Reporter
- ReAct
- Chain-of-Thought
Tool Use
- Visual ChatGPT-style tool chaining
- Program-based tool orchestration (VISPROG)
Frameworks
- MIMIC-IT
- MIC
- InstructBLIP/visual instruction tuning
Architectures
- visual-encoder + connector + LLM
- query-based connector (Q-Former)
- cross-attention connector (perceiver resampler)
Collaboration
- Socratic Models (model composition)
Optimization Features
Training Optimization
- unfreeze LLM selectively
- multi-task supervised stage
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Survey depends on public papers and reported leaderboards; direct head-to-head runs are limited.
- Many benchmarks summarized are not reasoning-step annotated, limiting causal claims about model reasoning.
- Quantitative comparisons mix models of different compute budgets and undisclosed proprietary training, reducing attribution precision.
When Not To Use
- For tasks needing formal, provable logical reasoning where correctness must be guaranteed.
- When you require long-context multimodal reasoning beyond current short-context MLLM windows.
- If auditability of intermediate reasoning steps is required but your chosen benchmark lacks step annotations.
Failure Modes
- Hallucination from visual or language modules leading to wrong but plausible answers.
- Catastrophic forgetting of language-only capabilities after aggressive visual instruction fine-tuning.
- Sensitivity to prompt format and answer permutations in multiple-choice setups.
Core Entities
Models
- GPT-4V
- Qwen-VL-Chat
- InfiMM-LLaMA-13B
- SPHINX-v2
- CogVLM-Chat
- MiniGPT-4
- BLIP-2
- LLaVA-1.5
- InstructBLIP
- Otter
- mPLUG-Owl2
Metrics
- Accuracy
- GPT-4 evaluation
- Caption Score
- Elo score
Datasets
- InfiMM-Eval
- MMMU
- MM-Vet
- ScienceQA
- VQAv2
- GQA
- OK-VQA
- MMBench
- LLM-eHub
- SparklesEval
- HallusionBench
- MathVista
Benchmarks
- InfiMM-Eval
- MMMU
- MM-Vet
- HallusionBench
- MathVista
- SparklesEval
Context Entities
Models
- Flamingo
- BLIP-2
- LLaMA
- PaLM-E
- RT-2
Metrics
- BLEU
- CIDEr
- ROUGE
Datasets
- COCO caption
- Flickr30K
- Visual Genome
- LAION
- CC3M

