Overview
MLLMs are production-ready for non-safety-critical visual assistants and prototypes, but expect substantial compute, careful evaluation for hallucinations, and task-specific tuning.
Citations3
Evidence Strength0.85
Confidence0.86
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/8
Findings with evidence refs: 8/8
Results with explicit delta: 1/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
MLLMs let products understand and generate images and language together, enabling visual assistants, grounded search, and image editing workflows — but expect high compute, hallucination risk, and evaluation blind spots.
Who Should Care
Summary TLDR
This paper surveys visual-focused Multimodal Large Language Models (MLLMs). It explains common designs (frozen or trainable visual encoder + LLM + adapter), training recipes (single-stage vs two-stage and visual instruction tuning), key datasets and benchmarks, and tasks (VQA, captioning, grounding, image generation/editing, video and 3D). It compiles model architectures, dataset sizes, compute needs, and evaluation results, and highlights practical gaps: hallucinations, evaluation biases, heavy compute costs, and limited RAG for visual tasks.
Problem Statement
MLLM research is fast and fragmented. Practitioners need a compact map of how systems are built, trained, measured, and where they fail — especially for visual grounding, image generation, hallucination risk, and compute cost.
Main Contribution
Catalogs recent visual MLLMs and their three core parts: visual encoder, LLM backbone, and vision-to-language adapter.
Explains common training flows: frozen vs trainable encoders, single-stage and two-stage visual instruction tuning, and PEFT use.
Key Findings
Typical MLLM design is three parts: visual encoder, LLM backbone, and adapter.
Freezing the visual encoder is common but can limit fine-grained alignment.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Emu2 84.9; LLaVA-1.5 ~80.0; CogVLM 82.3 | — | — | VQAv2 (Table 4) | Top reported VQA numbers across models | Table 4 |
| RefCOCO referring (testA) | CogVLM 94.8; Qwen-VL 92.3; Ferret 92.4 | — | — | RefCOCO testA (Table 5) | High accuracy for grounding-capable MLLMs | Table 5 |
What To Try In 7 Days
Prototype a visual QA demo using a frozen CLIP encoder + LLaMA family LLM + linear adapter to test domain fit.
Run instruction tuning with a small in-domain visual instruction set (1k–10k examples) to reduce hallucinations.
Evaluate candidate models on a small, human-labeled subset of your target tasks to check grounding and hallucination before scaling.
Agent Features
Memory
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
May have missed minor or very recent works and non-visual modalities.
Space limits forced concise descriptions; check original papers for implementation details.
When Not To Use
In high-stakes domains without robust hallucination checks and verification.
When compute budget cannot support required fine-tuning or inference costs.
Failure Modes
Hallucinated objects or facts, especially on long or ambiguous captions.
Poor performance on fine-grained or small-object grounding when encoder is frozen.

