Survey of visual-focused multimodal LLMs: architectures, training, tasks, datasets, and open problems

February 19, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

3

Authors

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

Links

Abstract / PDF

Why It Matters For Business

MLLMs let products understand and generate images and language together, enabling visual assistants, grounded search, and image editing workflows — but expect high compute, hallucination risk, and evaluation blind spots.

Summary TLDR

This paper surveys visual-focused Multimodal Large Language Models (MLLMs). It explains common designs (frozen or trainable visual encoder + LLM + adapter), training recipes (single-stage vs two-stage and visual instruction tuning), key datasets and benchmarks, and tasks (VQA, captioning, grounding, image generation/editing, video and 3D). It compiles model architectures, dataset sizes, compute needs, and evaluation results, and highlights practical gaps: hallucinations, evaluation biases, heavy compute costs, and limited RAG for visual tasks.

Problem Statement

MLLM research is fast and fragmented. Practitioners need a compact map of how systems are built, trained, measured, and where they fail — especially for visual grounding, image generation, hallucination risk, and compute cost.

Main Contribution

Catalogs recent visual MLLMs and their three core parts: visual encoder, LLM backbone, and vision-to-language adapter.

Explains common training flows: frozen vs trainable encoders, single-stage and two-stage visual instruction tuning, and PEFT use.

Summarizes datasets, benchmarks, hardware costs, and quantitative results across VQA, captioning, grounding, and image generation.

Discusses open problems: hallucinations, safety/bias, multimodal RAG, and compute-efficiency strategies.

Key Findings

Typical MLLM design is three parts: visual encoder, LLM backbone, and adapter.

Freezing the visual encoder is common but can limit fine-grained alignment.

Visual instruction tuning is effective and widely used to teach LLMs multimodal dialogue.

NumbersLLaVA-Instruct: 158k multimodal instructions

Top-performing MLLMs reach about 80–85% on VQAv2 in evaluations reported.

NumbersBest VQA scores: Emu2 84.9, CogVLM 82.3 (Table 4)

Some models achieve very high region grounding scores (RefCOCO testA ~90–95%).

NumbersCogVLM RefCOCO testA 94.8; Qwen-VL 92.3 (Table 5)

Image generation with MLLM-conditioned diffusion can approach or beat base diffusion models on FID.

NumbersLaVIT FID 7.40 vs Stable Diffusion FID 9.22 (Table 8)

Training MLLMs often requires heavy hardware; many works use 8 A100s, while flagship models used hundreds or thousands of TPUs.

NumbersCommon training: 8 A100s; Flamingo used 1,535 TPUv4; PaLI used 1,024 TPUv4 (Table 11)

Hallucinations are common, especially for long captions and complex queries.

Results

Accuracy

ValueEmu2 84.9; LLaVA-1.5 ~80.0; CogVLM 82.3

RefCOCO referring (testA)

ValueCogVLM 94.8; Qwen-VL 92.3; Ferret 92.4

COCO image generation (FID)

ValueLaVIT FID 7.40; Stable Diffusion 9.22; GILL 12.20

BaselineStable Diffusion

Training dataset scale examples

ValueCOYO 747M pairs; WebLI 10B images; DataComp 12.8B pairs

Hardware used (examples)

ValueFlamingo 1,535 TPUv4; PaLI 1,024 TPUv4; many models use 8 A100

Who Should Care

What To Try In 7 Days

Prototype a visual QA demo using a frozen CLIP encoder + LLaMA family LLM + linear adapter to test domain fit.

Run instruction tuning with a small in-domain visual instruction set (1k–10k examples) to reduce hallucinations.

Evaluate candidate models on a small, human-labeled subset of your target tasks to check grounding and hallucination before scaling.

Agent Features

Memory

  • RAG for visual tasks is noted as under-explored; retrieval could add external facts

Tool Use

  • Integration with external detectors (SAM, Grounding-DINO) for grounding
  • Connecting to diffusion models (Stable Diffusion) for image generation

Frameworks

  • BLIP-2 style pipelines
  • Flamingo cross-attention designs
  • Perceiver resamplers

Architectures

  • LLM backbone (LLaMA family, Vicuna, Alpaca)
  • Vision encoders (CLIP ViT, EVA-CLIP, ViT variants)
  • Adapters (Linear/MLP, Q-Former, cross-attention, Perceiver)

Optimization Features

Token Efficiency

  • Project multiple visual tokens into one LLM token (token pooling)
  • Learnable queries to create fixed-length visual summaries

Infra Optimization

  • Most works train on clusters of A100 GPUs; some use TPU fleets for large runs
  • Recommend PEFT + frozen backbones for resource-limited setups

Model Optimization

  • Freeze visual encoder to save compute
  • Ensembling multiple frozen backbones for robustness

System Optimization

  • LoRA
  • Perceiver-based resampling to shorten token sequences

Training Optimization

  • Two-stage training: align visual features then instruction tune
  • Visual instruction tuning with synthetic GPT-generated data

Inference Optimization

  • Compress visual tokens via learnable queries to reduce LLM token load
  • Mixture-of-resolution and sub-image slicing to handle high-res inputs

Reproducibility

Data Urls

  • LAION: https://laion.ai
  • COYO: https://github.com/coyo-dataset
  • COCO: http://cocodataset.org

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • May have missed minor or very recent works and non-visual modalities.
  • Space limits forced concise descriptions; check original papers for implementation details.
  • Evaluation relies on published benchmarks that may use LLM judges, introducing bias.

When Not To Use

  • In high-stakes domains without robust hallucination checks and verification.
  • When compute budget cannot support required fine-tuning or inference costs.
  • If you need guaranteed pixel-perfect localization without specialized grounding components.

Failure Modes

  • Hallucinated objects or facts, especially on long or ambiguous captions.
  • Poor performance on fine-grained or small-object grounding when encoder is frozen.
  • Evaluation bias from automated LLM judges (GPT-4) masking real errors.
  • Domain mismatch when training data is web-scale and uncurated.

Core Entities

Models

  • Flamingo
  • BLIP-2
  • LLaVA
  • MiniGPT-4
  • Kosmos-2
  • Qwen-VL
  • Ferret
  • SPHINX
  • Emu/Emu2
  • CogVLM
  • GILL
  • LaVIT
  • MiniGPT-5
  • Kosmos-G

Metrics

  • Accuracy
  • CIDEr (captioning)
  • METEOR
  • RefCOCO Acc@0.5
  • FID (image generation)
  • CLIP-I, CLIP-T, DINO similarity

Datasets

  • LAION-400M
  • COYO-700M (747M pairs)
  • WebLI (10B)
  • DataComp (12.8B)
  • COCO
  • CC3M (3M)
  • MMC4
  • OBELICS (141M docs, 353M images)
  • LLaVA-Instruct (158k)
  • LRVInstruction (400k+ updates)

Benchmarks

  • VQAv2
  • GQA
  • VizWiz
  • TextVQA
  • ScienceQA
  • COCO captioning (CIDEr)
  • RefCOCO/RefCOCO+/RefCOCOg
  • POPE
  • MMBench
  • SEED-Bench
  • MM-Vet
  • MathVista
  • MagicBrush
  • DreamBench

Context Entities

Models

  • GPT-4V
  • Gemini
  • LLaMA family (LLaMA, LLaMA-2, Vicuna, Alpaca)
  • Mamba/VL-Mamba (state-space)
  • OPT
  • MPT

Metrics

  • ChatGPT/GPT-4 automatic scoring (LLM-as-judge) concerns

Datasets

  • Web interleaved corpora (MMC4, WebLI)
  • GranD, GRIT (grounded datasets)
  • VIST (interleaved generation)

Benchmarks

  • POPE (hallucination)
  • MME, MMBench
  • TouchStone, Tiny LVLM, SEED-Bench