Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
85
Why It Matters For Business
MLLMs let products combine vision and language: build image-aware assistants, document parsers, or multimodal agents. Focus on data quality, connector design, and safe alignment to reduce hallucinations before shipping.
Summary TLDR
This is a practical survey of Multimodal Large Language Models (MLLMs). It explains the common architecture (encoder + connector + LLM), the three-stage training recipe (pre-training, instruction tuning, alignment), datasets and benchmarks, common failure modes (especially multimodal hallucination), and three key techniques (Multimodal In-Context Learning, Multimodal Chain-of-Thought, and LLM-aided visual reasoning). The paper collects references and points to a GitHub resource for up-to-date papers.
Problem Statement
MLLMs aim to combine visual (and other) perception with LLM reasoning, but building reliable, general, and safe multimodal systems raises practical questions: which architecture and connectors work best, what data and tuning strategies matter, how to evaluate broad capabilities, and how to reduce hallucinations and safety risks.
Main Contribution
Clear modular abstraction of MLLMs: pre-trained modality encoder, pre-trained LLM, and modality interface (connector).
A compact training recipe: pre-training for alignment, instruction-tuning for instruction following, and alignment tuning (RLHF/DPO) for human preferences.
A review of datasets, benchmarks, evaluation methods, and mitigation techniques for multimodal hallucination.
A survey of extensions: finer-granularity inputs, more modalities and languages, task-specific and embodied agents, and practical techniques (M-ICL, M-CoT, LLM-aided reasoning).
A public GitHub collecting MLLM papers and resources.
Key Findings
MLLMs are typically built from three modules: a pre-trained modality encoder, a pre-trained LLM, and a connector between them.
Training is usually staged: pre-training (alignment on image-text pairs), instruction-tuning (teach instruction following), and alignment tuning (human-preference via RLHF/DPO).
Higher input image resolution and more visual tokens consistently improve performance more than larger encoder parameter size on some benchmarks.
Scaling LLM size helps MLLM performance; authors report gains when moving from 7B→13B and further to 34B+ showing emergent abilities (e.g., cross-lingual zero-shot).
Instruction-tuning data quality and prompt diversity matter more than raw quantity for downstream performance.
Multimodal hallucination appears in three types: existence, attribute, and relationship errors; mitigation approaches fall into pre-correction (data/tuning), in-process (architectural/decoding), and post-correction (revisors or expert checks).
Small interface modules are tiny relative to encoder and LLM (Q-Former ~0.08B is <1% of total in Qwen-VL example).
Benchmarks and evaluation for MLLMs must combine closed-set (task metrics) and open-set (human/GPT scoring); GPT-4V improves evaluation by seeing images.
Who Should Care
What To Try In 7 Days
Prototype by freezing a strong image encoder and off-the-shelf LLM; implement a small connector (Q-Former or MLP) to test tasks quickly.
Run an evaluation checklist: closed-set task metrics and sample open-set queries scored by GPT-4V or human raters to surface hallucinations.
Create a short, high-quality instruction-tuning set (100–1k examples) with diverse prompts rather than collecting large noisy corpora.
Agent Features
Memory
- Short-term: in-context examples (M-ICL)
- No standardized long-term memory yet
Planning
- LLM generates step sequences/programs (VisualProg/VisProg)
- Chain-of-Thought planning for subtask decomposition
Tool Use
- Invoke vision experts (segmentation, OCR, detectors)
- Call external tools via generated programs (GPT4Tools style)
Frameworks
- VisProg / MMREACT / HuggingGPT
Is Agentic
true
Architectures
- LLM-centered controller
- multi-module pipelines (LLM + vision experts)
- MoE
Collaboration
- LLM as decision maker coordinating modules
- Iterative multi-round decision workflows
Optimization Features
Token Efficiency
- Compress visual features into fewer visual tokens (Q-Former)
Infra Optimization
- Deploy smaller LLMs or quantized models on edge devices
Model Optimization
- MoE
- Use of compact connectors to limit retraining
System Optimization
- Freeze large pre-trained modules and train small adapters for fast iteration
Training Optimization
- Visual instruction tuning (task-formatted data)
- Self-instruction using GPT/GPT-4V to generate fine-grained instruction data
Inference Optimization
- Quantization and smaller LLM variants for mobile (MobileVLM)
- Dual-encoder or patch-division strategies for high-res images
Reproducibility
Data Urls
- LAION-5B; CC-12M; COYO-700M (datasets cited in paper)
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey format: no original experimental results or new benchmarks.
- Rapid field changes mean some model and dataset specifics age quickly; GitHub link needed to track updates.
- Focus mainly on vision+language; less depth on audio, 3D, and some deployment details.
When Not To Use
- Do not deploy vanilla MLLMs for high-stakes decisions without additional verification because of hallucination risk.
- Avoid relying on small noisy caption corpora alone when strong grounding is required.
- Not ideal for very long multimodal contexts (long video/document chains) without specialized long-context methods.
Failure Modes
- Existence hallucination: claiming objects that are not present.
- Attribute hallucination: misreporting colors, counts, or attributes.
- Relationship hallucination: incorrect spatial or interaction descriptions.
- Context hijacking and adversarial prompts that mislead the model.
Core Entities
Models
- GPT-4V
- LLaVA
- MiniGPT-4
- BLIP-2
- Flamingo
- Qwen-VL
- MM1
- MoE-LLaVA
- NExT-GPT
- ImageBind-LLM
- Shikra
- Osprey
- Ferret
- LLaMA
- Vicuna
- Qwen
- Flan-T5
Metrics
- Accuracy
- CIDEr (captioning)
- hallucination rates (CHAIR/POPE/AMBER/FaithScore)
- GPT scoring (GPT-4/GPT-4V)
Datasets
- LAION-5B
- LAION-2B
- CC-3M
- CC-12M
- COYO-700M
- MSR-VTT
- ShareGPT4V-PT
- LVIS-Instruct4V
- ALLaVA
- WavCaps
Benchmarks
- MME
- MMBench
- Video-Bench
- POPE
- MM-VET
- ScienceQA
- HallusionBench
- MMMU
- MathVista
Context Entities
Models
- Flamingo
- BLIP
- CLIP
- EVA-CLIP
- OpenCLIP
Metrics
- CIDEr
- BLEU
- cosine similarity (CLIP filtering)
Datasets
- MS-COCO
- Flickr30K
- NoCaps
- SBU Captions
Benchmarks
- POPE
- CHAIR
- AMBER

