Survey of 126 multimodal LLMs: architectures, training recipes, benchmarks, and next steps

Overview

Decision SnapshotNeeds Validation

The survey compiles many published results and practical recipes but notes dataset overlap and benchmark leakage; apply findings with dataset-leakage checks and small-scale validation.

Citations15

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 45%

Authors

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, Dong Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can add vision, audio, or other modalities to existing LLMs cheaply by training small projectors or PEFT adapters, unlocking richer user interactions without retraining huge models.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This paper is a focused, up-to-date survey of multimodal large language models (MM-LLMs). It defines a simple five-part architecture (modality encoder, input projector, LLM backbone, output projector, modality generator), catalogs 126 recent MM-LLMs, compares 43 models across architectures and datasets, summarizes performance on common vision-language benchmarks (OKVQA, IconVQA, VQA v2, GQA), and distills practical training recipes (higher image resolution, interleaved image-text data, PEFT). The survey highlights open problems: better benchmarks, lightweight/mobile deployment, continual learning, hallucination reduction, and bias evaluation. The authors host a live tracking website: https:/

Problem Statement

How to cheaply and effectively extend text-only LLMs to handle multiple input and output modalities, and how recent MM-LLMs compare in architecture, training data, benchmarks, and practical recipes.

Main Contribution

A unified five-component architecture for MM-LLMs, clarifying where to add lightweight adapters.

A taxonomy and catalog of 126 MM-LLMs with a focused comparison table for 43 mainstream models.

Key Findings

Most MM-LLMs add small adapters while keeping the core LLM frozen.

NumbersTrainable params typically ≈2% (projectors only); PEFT can be <0.1%

Practical UseYou can enable multimodality cheaply: add small input/output projectors or LoRA instead of full finetuning.

Evidence RefSec.2 intro; Sec.2.3 PEFT

Top MM-LLMs reach about 80% on VQA v2, but performance varies by task and dataset overlap.

NumbersVQA v2: top models ≈80.0–80.8% on reported test sets

Practical UseExpect strong but not perfect visual question answering; check dataset overlap before claiming generalization.

Evidence RefTable 2 (VQA v2 rows for LLaVA-1.5, VILA-13B, etc.)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	≈80.0–80.8%	—	—	VQA v2 (reported test/bench)	Table 2 rows for LLaVA-1.5, VILA-13B, +ShareGPT4V	Table 2
Accuracy	MiniGPT-v2: 56.9%	—	—	OKVQA	Table 2 OKVQA row	Table 2

What To Try In 7 Days

Prototype a proof-of-concept MM assistant by freezing an LLM and training a small linear projector on a 10k image-text SFT set

Measure dataset overlap before evaluating model claims on benchmarks to avoid leakage effects

Test higher visual encoder resolution (e.g., 336→448) on a dev set and track compute vs accuracy trade-off

Agent Features

Tool Use

LLM orchestrates external expert tools (VisualChatGPT style)

Frameworks

VisualChatGPTHuggingGPTNExT-GPT

Architectures

tool-using (black-box LLM + external experts)end-to-end any-to-any multimodal

Optimization Features

Token Efficiency

Visual token concatenationMulti-scale MQ-Former compression

Infra Optimization

Use frozen LLM + small adapters to avoid full retrain compute

Model Optimization

LoRA

System Optimization

Keep modality encoders frozen; only train small projectors

Training Optimization

Interleaved image-text pretrainingSFT

Inference Optimization

Concatenating visual tokens to reduce sequence length (MiniGPT-v2)Lightweight downsample projectors for mobile

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://mm-llms.github.io (tracking website)

Data URLs

Datasets listed (LAION, COCO, WebLI, WebVid) are publicly referenced

Risks & Boundaries

Limitations

Survey may miss the latest models; authors maintain a live website for updates.

Many benchmarks overlap with training data, so reported scores may overestimate real generalization.

When Not To Use

When strict factual grounding is mandatory and retrieval/verification is required

When operating on devices with very tight memory and no support for lightweight adapters

Failure Modes

Modal hallucination: describing objects not present in the input

Bias amplification from skewed multimodal training data

Core Entities

Models

BLIP-2LLaVAMiniGPT-4MiniGPT-5MiniGPT-v2InstructBLIPVILALLaVA-1.5Qwen-VLNExT-GPTCoDi-2EmuFlamingoOpenFlamingoGILLPaLI-XPandaGPT

Metrics

Accuracybenchmark score (aggregate)

Datasets

LAION-5BCOCOWebLIM3W (Interleaved)MMC4ObelicsWebVidMSRVTTALIGNDataComp

Benchmarks

OKVQAIconVQAVQA v2GQAMMBenchMM-VetQBenchHatefulMemes

Context Entities

Models

GPT-4VGeminiPaLM-EVicunaLLaMA-2Flan-T5ChinchillaQwen

Metrics

AccuracyMM-perception and cognition scores (MME P/C)

Datasets

LAION-enCC3MCC12MVisual GenomeMSRVTTTextVQADocVQA

Benchmarks

MMBench-ChineseSEED-BenchVizWiz

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most MM-LLMs add small adapters while keeping the core LLM frozen.

Top MM-LLMs reach about 80% on VQA v2, but performance varies by task and dataset overlap.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

HOPE: search image-specific, highly misleading distractors to better expose object hallucinations in LVLMs

Key finding

Survey of multimodal RAG: methods, datasets, benchmarks, and open problems

Key finding

Practical guide: which design choices help when adding image input to LLMs

Key finding

HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

Key finding