Survey of 126 multimodal LLMs: architectures, training recipes, benchmarks, and next steps

January 24, 20247 min

Overview

Decision SnapshotNeeds Validation

The survey compiles many published results and practical recipes but notes dataset overlap and benchmark leakage; apply findings with dataset-leakage checks and small-scale validation.

Citations15

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 45%

Authors

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, Dong Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can add vision, audio, or other modalities to existing LLMs cheaply by training small projectors or PEFT adapters, unlocking richer user interactions without retraining huge models.

Who Should Care

Summary TLDR

This paper is a focused, up-to-date survey of multimodal large language models (MM-LLMs). It defines a simple five-part architecture (modality encoder, input projector, LLM backbone, output projector, modality generator), catalogs 126 recent MM-LLMs, compares 43 models across architectures and datasets, summarizes performance on common vision-language benchmarks (OKVQA, IconVQA, VQA v2, GQA), and distills practical training recipes (higher image resolution, interleaved image-text data, PEFT). The survey highlights open problems: better benchmarks, lightweight/mobile deployment, continual learning, hallucination reduction, and bias evaluation. The authors host a live tracking website: https:/

Problem Statement

How to cheaply and effectively extend text-only LLMs to handle multiple input and output modalities, and how recent MM-LLMs compare in architecture, training data, benchmarks, and practical recipes.

Main Contribution

A unified five-component architecture for MM-LLMs, clarifying where to add lightweight adapters.

A taxonomy and catalog of 126 MM-LLMs with a focused comparison table for 43 mainstream models.

Key Findings

Most MM-LLMs add small adapters while keeping the core LLM frozen.

NumbersTrainable params typically ≈2% (projectors only); PEFT can be <0.1%

Practical UseYou can enable multimodality cheaply: add small input/output projectors or LoRA instead of full finetuning.

Evidence RefSec.2 intro; Sec.2.3 PEFT

Top MM-LLMs reach about 80% on VQA v2, but performance varies by task and dataset overlap.

NumbersVQA v2: top models ≈80.080.8% on reported test sets

Practical UseExpect strong but not perfect visual question answering; check dataset overlap before claiming generalization.

Evidence RefTable 2 (VQA v2 rows for LLaVA-1.5, VILA-13B, etc.)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy≈80.080.8%VQA v2 (reported test/bench)Table 2 rows for LLaVA-1.5, VILA-13B, +ShareGPT4VTable 2
AccuracyMiniGPT-v2: 56.9%OKVQATable 2 OKVQA rowTable 2

What To Try In 7 Days

Prototype a proof-of-concept MM assistant by freezing an LLM and training a small linear projector on a 10k image-text SFT set

Measure dataset overlap before evaluating model claims on benchmarks to avoid leakage effects

Test higher visual encoder resolution (e.g., 336→448) on a dev set and track compute vs accuracy trade-off

Agent Features

Tool Use
LLM orchestrates external expert tools (VisualChatGPT style)
Frameworks
VisualChatGPTHuggingGPTNExT-GPT
Architectures
tool-using (black-box LLM + external experts)end-to-end any-to-any multimodal

Optimization Features

Token Efficiency
Visual token concatenationMulti-scale MQ-Former compression
Infra Optimization
Use frozen LLM + small adapters to avoid full retrain compute
Model Optimization
LoRA
System Optimization
Keep modality encoders frozen; only train small projectors
Training Optimization
Interleaved image-text pretrainingSFT
Inference Optimization
Concatenating visual tokens to reduce sequence length (MiniGPT-v2)Lightweight downsample projectors for mobile

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Datasets listed (LAION, COCO, WebLI, WebVid) are publicly referenced

Risks & Boundaries

Limitations

Survey may miss the latest models; authors maintain a live website for updates.

Many benchmarks overlap with training data, so reported scores may overestimate real generalization.

When Not To Use

When strict factual grounding is mandatory and retrieval/verification is required

When operating on devices with very tight memory and no support for lightweight adapters

Failure Modes

Modal hallucination: describing objects not present in the input

Bias amplification from skewed multimodal training data

Core Entities

Models

BLIP-2LLaVAMiniGPT-4MiniGPT-5MiniGPT-v2InstructBLIPVILALLaVA-1.5Qwen-VLNExT-GPTCoDi-2EmuFlamingoOpenFlamingoGILLPaLI-XPandaGPT

Metrics

Accuracybenchmark score (aggregate)

Datasets

LAION-5BCOCOWebLIM3W (Interleaved)MMC4ObelicsWebVidMSRVTTALIGNDataComp

Benchmarks

OKVQAIconVQAVQA v2GQAMMBenchMM-VetQBenchHatefulMemes

Context Entities

Models

GPT-4VGeminiPaLM-EVicunaLLaMA-2Flan-T5ChinchillaQwen

Metrics

AccuracyMM-perception and cognition scores (MME P/C)

Datasets

LAION-enCC3MCC12MVisual GenomeMSRVTTTextVQADocVQA

Benchmarks

MMBench-ChineseSEED-BenchVizWiz