A concise roadmap to multimodal LLMs: architectures, training recipes, evaluation, hallucination, and extensions

Overview

Decision SnapshotNeeds Validation

The paper synthesizes many practical, community-tested ideas and datasets. It is a useful blueprint for prototyping MLLMs but does not present new experimental results.

Citations85

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/8

Findings with evidence refs: 8/8

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MLLMs let products combine vision and language: build image-aware assistants, document parsers, or multimodal agents. Focus on data quality, connector design, and safe alignment to reduce hallucinations before shipping.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This is a practical survey of Multimodal Large Language Models (MLLMs). It explains the common architecture (encoder + connector + LLM), the three-stage training recipe (pre-training, instruction tuning, alignment), datasets and benchmarks, common failure modes (especially multimodal hallucination), and three key techniques (Multimodal In-Context Learning, Multimodal Chain-of-Thought, and LLM-aided visual reasoning). The paper collects references and points to a GitHub resource for up-to-date papers.

Problem Statement

MLLMs aim to combine visual (and other) perception with LLM reasoning, but building reliable, general, and safe multimodal systems raises practical questions: which architecture and connectors work best, what data and tuning strategies matter, how to evaluate broad capabilities, and how to reduce hallucinations and safety risks.

Main Contribution

Clear modular abstraction of MLLMs: pre-trained modality encoder, pre-trained LLM, and modality interface (connector).

A compact training recipe: pre-training for alignment, instruction-tuning for instruction following, and alignment tuning (RLHF/DPO) for human preferences.

Key Findings

MLLMs are typically built from three modules: a pre-trained modality encoder, a pre-trained LLM, and a connector between them.

Practical UseWhen building an MLLM, reuse strong pre-trained encoders and LLMs and focus engineering effort on the connector to align features.

Evidence Ref§2 (Architecture)

Training is usually staged: pre-training (alignment on image-text pairs), instruction-tuning (teach instruction following), and alignment tuning (human-preference via RLHF/DPO).

Practical UseFollow a three-stage pipeline: align modalities first, then instruction-tune, then apply preference tuning only if you need safer/less-hallucinated outputs.

Evidence Ref§3 (Training strategy and data)

What To Try In 7 Days

Prototype by freezing a strong image encoder and off-the-shelf LLM; implement a small connector (Q-Former or MLP) to test tasks quickly.

Run an evaluation checklist: closed-set task metrics and sample open-set queries scored by GPT-4V or human raters to surface hallucinations.

Create a short, high-quality instruction-tuning set (100–1k examples) with diverse prompts rather than collecting large noisy corpora.

Agent Features

Memory

Short-term: in-context examples (M-ICL)No standardized long-term memory yet

Planning

LLM generates step sequences/programs (VisualProg/VisProg)Chain-of-Thought planning for subtask decomposition

Tool Use

Invoke vision experts (segmentation, OCR, detectors)Call external tools via generated programs (GPT4Tools style)

Frameworks

VisProg / MMREACT / HuggingGPT

Is Agentic

Yes

Architectures

LLM-centered controllermulti-module pipelines (LLM + vision experts)MoE

Collaboration

LLM as decision maker coordinating modulesIterative multi-round decision workflows

Optimization Features

Token Efficiency

Compress visual features into fewer visual tokens (Q-Former)

Infra Optimization

Deploy smaller LLMs or quantized models on edge devices

Model Optimization

MoEUse of compact connectors to limit retraining

System Optimization

Freeze large pre-trained modules and train small adapters for fast iteration

Training Optimization

Visual instruction tuning (task-formatted data)Self-instruction using GPT/GPT-4V to generate fine-grained instruction data

Inference Optimization

Quantization and smaller LLM variants for mobile (MobileVLM)Dual-encoder or patch-division strategies for high-res images

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

Data URLs

LAION-5B; CC-12M; COYO-700M (datasets cited in paper)

Risks & Boundaries

Limitations

Survey format: no original experimental results or new benchmarks.

Rapid field changes mean some model and dataset specifics age quickly; GitHub link needed to track updates.

When Not To Use

Do not deploy vanilla MLLMs for high-stakes decisions without additional verification because of hallucination risk.

Avoid relying on small noisy caption corpora alone when strong grounding is required.

Failure Modes

Existence hallucination: claiming objects that are not present.

Attribute hallucination: misreporting colors, counts, or attributes.

Core Entities

Models

GPT-4VLLaVAMiniGPT-4BLIP-2FlamingoQwen-VLMM1MoE-LLaVANExT-GPTImageBind-LLMShikraOspreyFerretLLaMAVicunaQwenFlan-T5

Metrics

AccuracyCIDEr (captioning)hallucination rates (CHAIR/POPE/AMBER/FaithScore)GPT scoring (GPT-4/GPT-4V)

Datasets

LAION-5BLAION-2BCC-3MCC-12MCOYO-700MMSR-VTTShareGPT4V-PTLVIS-Instruct4VALLaVAWavCaps

Benchmarks

MMEMMBenchVideo-BenchPOPEMM-VETScienceQAHallusionBenchMMMUMathVista

Context Entities

Models

FlamingoBLIPCLIPEVA-CLIPOpenCLIP

Metrics

CIDErBLEUcosine similarity (CLIP filtering)

Datasets

MS-COCOFlickr30KNoCapsSBU Captions

Benchmarks

POPECHAIRAMBER

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MLLMs are typically built from three modules: a pre-trained modality encoder, a pre-trained LLM, and a connector between them.

Training is usually staged: pre-training (alignment on image-text pairs), instruction-tuning (teach instruction following), and alignment tuning (human-preference via RLHF/DPO).

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-