A concise roadmap to multimodal LLMs: architectures, training recipes, evaluation, hallucination, and extensions

June 23, 20239 min

Overview

Decision SnapshotNeeds Validation

The paper synthesizes many practical, community-tested ideas and datasets. It is a useful blueprint for prototyping MLLMs but does not present new experimental results.

Citations85

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/8

Findings with evidence refs: 8/8

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MLLMs let products combine vision and language: build image-aware assistants, document parsers, or multimodal agents. Focus on data quality, connector design, and safe alignment to reduce hallucinations before shipping.

Who Should Care

Summary TLDR

This is a practical survey of Multimodal Large Language Models (MLLMs). It explains the common architecture (encoder + connector + LLM), the three-stage training recipe (pre-training, instruction tuning, alignment), datasets and benchmarks, common failure modes (especially multimodal hallucination), and three key techniques (Multimodal In-Context Learning, Multimodal Chain-of-Thought, and LLM-aided visual reasoning). The paper collects references and points to a GitHub resource for up-to-date papers.

Problem Statement

MLLMs aim to combine visual (and other) perception with LLM reasoning, but building reliable, general, and safe multimodal systems raises practical questions: which architecture and connectors work best, what data and tuning strategies matter, how to evaluate broad capabilities, and how to reduce hallucinations and safety risks.

Main Contribution

Clear modular abstraction of MLLMs: pre-trained modality encoder, pre-trained LLM, and modality interface (connector).

A compact training recipe: pre-training for alignment, instruction-tuning for instruction following, and alignment tuning (RLHF/DPO) for human preferences.

Key Findings

MLLMs are typically built from three modules: a pre-trained modality encoder, a pre-trained LLM, and a connector between them.

Practical UseWhen building an MLLM, reuse strong pre-trained encoders and LLMs and focus engineering effort on the connector to align features.

Evidence Ref§2 (Architecture)

Training is usually staged: pre-training (alignment on image-text pairs), instruction-tuning (teach instruction following), and alignment tuning (human-preference via RLHF/DPO).

Practical UseFollow a three-stage pipeline: align modalities first, then instruction-tune, then apply preference tuning only if you need safer/less-hallucinated outputs.

Evidence Ref§3 (Training strategy and data)

What To Try In 7 Days

Prototype by freezing a strong image encoder and off-the-shelf LLM; implement a small connector (Q-Former or MLP) to test tasks quickly.

Run an evaluation checklist: closed-set task metrics and sample open-set queries scored by GPT-4V or human raters to surface hallucinations.

Create a short, high-quality instruction-tuning set (100–1k examples) with diverse prompts rather than collecting large noisy corpora.

Agent Features

Memory
Short-term: in-context examples (M-ICL)No standardized long-term memory yet
Planning
LLM generates step sequences/programs (VisualProg/VisProg)Chain-of-Thought planning for subtask decomposition
Tool Use
Invoke vision experts (segmentation, OCR, detectors)Call external tools via generated programs (GPT4Tools style)
Frameworks
VisProg / MMREACT / HuggingGPT
Is Agentic

Yes

Architectures
LLM-centered controllermulti-module pipelines (LLM + vision experts)MoE
Collaboration
LLM as decision maker coordinating modulesIterative multi-round decision workflows

Optimization Features

Token Efficiency
Compress visual features into fewer visual tokens (Q-Former)
Infra Optimization
Deploy smaller LLMs or quantized models on edge devices
Model Optimization
MoEUse of compact connectors to limit retraining
System Optimization
Freeze large pre-trained modules and train small adapters for fast iteration
Training Optimization
Visual instruction tuning (task-formatted data)Self-instruction using GPT/GPT-4V to generate fine-grained instruction data
Inference Optimization
Quantization and smaller LLM variants for mobile (MobileVLM)Dual-encoder or patch-division strategies for high-res images

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Data URLs

LAION-5B; CC-12M; COYO-700M (datasets cited in paper)

Risks & Boundaries

Limitations

Survey format: no original experimental results or new benchmarks.

Rapid field changes mean some model and dataset specifics age quickly; GitHub link needed to track updates.

When Not To Use

Do not deploy vanilla MLLMs for high-stakes decisions without additional verification because of hallucination risk.

Avoid relying on small noisy caption corpora alone when strong grounding is required.

Failure Modes

Existence hallucination: claiming objects that are not present.

Attribute hallucination: misreporting colors, counts, or attributes.

Core Entities

Models

GPT-4VLLaVAMiniGPT-4BLIP-2FlamingoQwen-VLMM1MoE-LLaVANExT-GPTImageBind-LLMShikraOspreyFerretLLaMAVicunaQwenFlan-T5

Metrics

AccuracyCIDEr (captioning)hallucination rates (CHAIR/POPE/AMBER/FaithScore)GPT scoring (GPT-4/GPT-4V)

Datasets

LAION-5BLAION-2BCC-3MCC-12MCOYO-700MMSR-VTTShareGPT4V-PTLVIS-Instruct4VALLaVAWavCaps

Benchmarks

MMEMMBenchVideo-BenchPOPEMM-VETScienceQAHallusionBenchMMMUMathVista

Context Entities

Models

FlamingoBLIPCLIPEVA-CLIPOpenCLIP

Metrics

CIDErBLEUcosine similarity (CLIP filtering)

Datasets

MS-COCOFlickr30KNoCapsSBU Captions

Benchmarks

POPECHAIRAMBER