Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
3
Why It Matters For Business
MLLMs let products understand and generate images and language together, enabling visual assistants, grounded search, and image editing workflows — but expect high compute, hallucination risk, and evaluation blind spots.
Summary TLDR
This paper surveys visual-focused Multimodal Large Language Models (MLLMs). It explains common designs (frozen or trainable visual encoder + LLM + adapter), training recipes (single-stage vs two-stage and visual instruction tuning), key datasets and benchmarks, and tasks (VQA, captioning, grounding, image generation/editing, video and 3D). It compiles model architectures, dataset sizes, compute needs, and evaluation results, and highlights practical gaps: hallucinations, evaluation biases, heavy compute costs, and limited RAG for visual tasks.
Problem Statement
MLLM research is fast and fragmented. Practitioners need a compact map of how systems are built, trained, measured, and where they fail — especially for visual grounding, image generation, hallucination risk, and compute cost.
Main Contribution
Catalogs recent visual MLLMs and their three core parts: visual encoder, LLM backbone, and vision-to-language adapter.
Explains common training flows: frozen vs trainable encoders, single-stage and two-stage visual instruction tuning, and PEFT use.
Summarizes datasets, benchmarks, hardware costs, and quantitative results across VQA, captioning, grounding, and image generation.
Discusses open problems: hallucinations, safety/bias, multimodal RAG, and compute-efficiency strategies.
Key Findings
Typical MLLM design is three parts: visual encoder, LLM backbone, and adapter.
Freezing the visual encoder is common but can limit fine-grained alignment.
Visual instruction tuning is effective and widely used to teach LLMs multimodal dialogue.
Top-performing MLLMs reach about 80–85% on VQAv2 in evaluations reported.
Some models achieve very high region grounding scores (RefCOCO testA ~90–95%).
Image generation with MLLM-conditioned diffusion can approach or beat base diffusion models on FID.
Training MLLMs often requires heavy hardware; many works use 8 A100s, while flagship models used hundreds or thousands of TPUs.
Hallucinations are common, especially for long captions and complex queries.
Results
Accuracy
RefCOCO referring (testA)
COCO image generation (FID)
Training dataset scale examples
Hardware used (examples)
Who Should Care
What To Try In 7 Days
Prototype a visual QA demo using a frozen CLIP encoder + LLaMA family LLM + linear adapter to test domain fit.
Run instruction tuning with a small in-domain visual instruction set (1k–10k examples) to reduce hallucinations.
Evaluate candidate models on a small, human-labeled subset of your target tasks to check grounding and hallucination before scaling.
Agent Features
Memory
- RAG for visual tasks is noted as under-explored; retrieval could add external facts
Tool Use
- Integration with external detectors (SAM, Grounding-DINO) for grounding
- Connecting to diffusion models (Stable Diffusion) for image generation
Frameworks
- BLIP-2 style pipelines
- Flamingo cross-attention designs
- Perceiver resamplers
Architectures
- LLM backbone (LLaMA family, Vicuna, Alpaca)
- Vision encoders (CLIP ViT, EVA-CLIP, ViT variants)
- Adapters (Linear/MLP, Q-Former, cross-attention, Perceiver)
Optimization Features
Token Efficiency
- Project multiple visual tokens into one LLM token (token pooling)
- Learnable queries to create fixed-length visual summaries
Infra Optimization
- Most works train on clusters of A100 GPUs; some use TPU fleets for large runs
- Recommend PEFT + frozen backbones for resource-limited setups
Model Optimization
- Freeze visual encoder to save compute
- Ensembling multiple frozen backbones for robustness
System Optimization
- LoRA
- Perceiver-based resampling to shorten token sequences
Training Optimization
- Two-stage training: align visual features then instruction tune
- Visual instruction tuning with synthetic GPT-generated data
Inference Optimization
- Compress visual tokens via learnable queries to reduce LLM token load
- Mixture-of-resolution and sub-image slicing to handle high-res inputs
Reproducibility
Data Urls
- LAION: https://laion.ai
- COYO: https://github.com/coyo-dataset
- COCO: http://cocodataset.org
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- May have missed minor or very recent works and non-visual modalities.
- Space limits forced concise descriptions; check original papers for implementation details.
- Evaluation relies on published benchmarks that may use LLM judges, introducing bias.
When Not To Use
- In high-stakes domains without robust hallucination checks and verification.
- When compute budget cannot support required fine-tuning or inference costs.
- If you need guaranteed pixel-perfect localization without specialized grounding components.
Failure Modes
- Hallucinated objects or facts, especially on long or ambiguous captions.
- Poor performance on fine-grained or small-object grounding when encoder is frozen.
- Evaluation bias from automated LLM judges (GPT-4) masking real errors.
- Domain mismatch when training data is web-scale and uncurated.
Core Entities
Models
- Flamingo
- BLIP-2
- LLaVA
- MiniGPT-4
- Kosmos-2
- Qwen-VL
- Ferret
- SPHINX
- Emu/Emu2
- CogVLM
- GILL
- LaVIT
- MiniGPT-5
- Kosmos-G
Metrics
- Accuracy
- CIDEr (captioning)
- METEOR
- RefCOCO Acc@0.5
- FID (image generation)
- CLIP-I, CLIP-T, DINO similarity
Datasets
- LAION-400M
- COYO-700M (747M pairs)
- WebLI (10B)
- DataComp (12.8B)
- COCO
- CC3M (3M)
- MMC4
- OBELICS (141M docs, 353M images)
- LLaVA-Instruct (158k)
- LRVInstruction (400k+ updates)
Benchmarks
- VQAv2
- GQA
- VizWiz
- TextVQA
- ScienceQA
- COCO captioning (CIDEr)
- RefCOCO/RefCOCO+/RefCOCOg
- POPE
- MMBench
- SEED-Bench
- MM-Vet
- MathVista
- MagicBrush
- DreamBench
Context Entities
Models
- GPT-4V
- Gemini
- LLaMA family (LLaMA, LLaMA-2, Vicuna, Alpaca)
- Mamba/VL-Mamba (state-space)
- OPT
- MPT
Metrics
- ChatGPT/GPT-4 automatic scoring (LLM-as-judge) concerns
Datasets
- Web interleaved corpora (MMC4, WebLI)
- GranD, GRIT (grounded datasets)
- VIST (interleaved generation)
Benchmarks
- POPE (hallucination)
- MME, MMBench
- TouchStone, Tiny LVLM, SEED-Bench

