Survey of visual-focused multimodal LLMs: architectures, training, tasks, datasets, and open problems

Overview

Decision SnapshotReady For Pilot

MLLMs are production-ready for non-safety-critical visual assistants and prototypes, but expect substantial compute, careful evaluation for hallucinations, and task-specific tuning.

Citations3

Evidence Strength0.85

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/8

Findings with evidence refs: 8/8

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

Links

Abstract / PDF / Data

Why It Matters For Business

MLLMs let products understand and generate images and language together, enabling visual assistants, grounded search, and image editing workflows — but expect high compute, hallucination risk, and evaluation blind spots.

Who Should Care

Product Manager ML Engineer CTO Founder Data Scientist

Summary TLDR

This paper surveys visual-focused Multimodal Large Language Models (MLLMs). It explains common designs (frozen or trainable visual encoder + LLM + adapter), training recipes (single-stage vs two-stage and visual instruction tuning), key datasets and benchmarks, and tasks (VQA, captioning, grounding, image generation/editing, video and 3D). It compiles model architectures, dataset sizes, compute needs, and evaluation results, and highlights practical gaps: hallucinations, evaluation biases, heavy compute costs, and limited RAG for visual tasks.

Problem Statement

MLLM research is fast and fragmented. Practitioners need a compact map of how systems are built, trained, measured, and where they fail — especially for visual grounding, image generation, hallucination risk, and compute cost.

Main Contribution

Catalogs recent visual MLLMs and their three core parts: visual encoder, LLM backbone, and vision-to-language adapter.

Explains common training flows: frozen vs trainable encoders, single-stage and two-stage visual instruction tuning, and PEFT use.

Key Findings

Typical MLLM design is three parts: visual encoder, LLM backbone, and adapter.

Practical UseWhen building an MLLM, pick an off-the-shelf visual encoder, a chat-capable LLM, and a lightweight adapter to connect them; this is the standard, low-effort path.

Evidence RefSec.2, Fig.1, Table 1

Freezing the visual encoder is common but can limit fine-grained alignment.

Practical UseFreeze encoders to save compute. If you need fine localization or VQA accuracy, consider a two-stage or partially trainable vision backbone.

Evidence RefSec.2.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Emu2 84.9; LLaVA-1.5 ~80.0; CogVLM 82.3	—	—	VQAv2 (Table 4)	Top reported VQA numbers across models	Table 4
RefCOCO referring (testA)	CogVLM 94.8; Qwen-VL 92.3; Ferret 92.4	—	—	RefCOCO testA (Table 5)	High accuracy for grounding-capable MLLMs	Table 5

What To Try In 7 Days

Prototype a visual QA demo using a frozen CLIP encoder + LLaMA family LLM + linear adapter to test domain fit.

Run instruction tuning with a small in-domain visual instruction set (1k–10k examples) to reduce hallucinations.

Evaluate candidate models on a small, human-labeled subset of your target tasks to check grounding and hallucination before scaling.

Agent Features

Memory

RAG for visual tasks is noted as under-explored; retrieval could add external facts

Tool Use

Integration with external detectors (SAM, Grounding-DINO) for groundingConnecting to diffusion models (Stable Diffusion) for image generation

Frameworks

BLIP-2 style pipelinesFlamingo cross-attention designsPerceiver resamplers

Architectures

LLM backbone (LLaMA family, Vicuna, Alpaca)Vision encoders (CLIP ViT, EVA-CLIP, ViT variants)Adapters (Linear/MLP, Q-Former, cross-attention, Perceiver)

Optimization Features

Token Efficiency

Project multiple visual tokens into one LLM token (token pooling)Learnable queries to create fixed-length visual summaries

Infra Optimization

Most works train on clusters of A100 GPUs; some use TPU fleets for large runsRecommend PEFT + frozen backbones for resource-limited setups

Model Optimization

Freeze visual encoder to save computeEnsembling multiple frozen backbones for robustness

System Optimization

LoRAPerceiver-based resampling to shorten token sequences

Training Optimization

Two-stage training: align visual features then instruction tuneVisual instruction tuning with synthetic GPT-generated data

Inference Optimization

Compress visual tokens via learnable queries to reduce LLM token loadMixture-of-resolution and sub-image slicing to handle high-res inputs

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

LAION: https://laion.aiCOYO: https://github.com/coyo-datasetCOCO: http://cocodataset.org

Risks & Boundaries

Limitations

May have missed minor or very recent works and non-visual modalities.

Space limits forced concise descriptions; check original papers for implementation details.

When Not To Use

In high-stakes domains without robust hallucination checks and verification.

When compute budget cannot support required fine-tuning or inference costs.

Failure Modes

Hallucinated objects or facts, especially on long or ambiguous captions.

Poor performance on fine-grained or small-object grounding when encoder is frozen.

Core Entities

Models

FlamingoBLIP-2LLaVAMiniGPT-4Kosmos-2Qwen-VLFerretSPHINXEmu/Emu2CogVLMGILLLaVITMiniGPT-5Kosmos-G

Metrics

AccuracyCIDEr (captioning)METEORRefCOCO Acc@0.5FID (image generation)CLIP-I, CLIP-T, DINO similarity

Datasets

LAION-400MCOYO-700M (747M pairs)WebLI (10B)DataComp (12.8B)COCOCC3M (3M)MMC4OBELICS (141M docs, 353M images)LLaVA-Instruct (158k)LRVInstruction (400k+ updates)

Benchmarks

VQAv2GQAVizWizTextVQAScienceQACOCO captioning (CIDEr)RefCOCO/RefCOCO+/RefCOCOgPOPEMMBenchSEED-BenchMM-VetMathVistaMagicBrushDreamBench

Context Entities

Models

GPT-4VGeminiLLaMA family (LLaMA, LLaMA-2, Vicuna, Alpaca)Mamba/VL-Mamba (state-space)OPTMPT

Metrics

ChatGPT/GPT-4 automatic scoring (LLM-as-judge) concerns

Datasets

Web interleaved corpora (MMC4, WebLI)GranD, GRIT (grounded datasets)VIST (interleaved generation)

Benchmarks

POPE (hallucination)MME, MMBenchTouchStone, Tiny LVLM, SEED-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Typical MLLM design is three parts: visual encoder, LLM backbone, and adapter.

Freezing the visual encoder is common but can limit fine-grained alignment.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

HOPE: search image-specific, highly misleading distractors to better expose object hallucinations in LVLMs

Key finding

Survey of multimodal RAG: methods, datasets, benchmarks, and open problems

Key finding

Practical guide: which design choices help when adding image input to LLMs

Key finding

HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

Key finding