Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
Better data curation reduces compute and improves multimodal model reliability; selective filtering and high-quality instruction data can cut costs while keeping most performance.
Summary TLDR
This paper surveys multimodal large language models (MLLMs) with a focus on the data side: where multimodal data come from, how to filter/deduplicate/augment it, how to mix modalities and domains during pre-training, and how to build and select instruction tuning and RLHF datasets. It compiles common datasets, concrete processing methods, evaluation metrics, and open problems such as missing multimodal data metrics, practical data pipelines, and unclear scaling laws for multimodal mixtures.
Problem Statement
MLLM progress is often driven by model and architecture work, but data — collection, cleaning, mixing, and alignment across modalities — is equally decisive. Practitioners lack a consolidated view of multimodal data pipelines, selection rules, and evaluation metrics tailored to MLLMs.
Main Contribution
A unified, data-centric taxonomy and pipeline for MLLMs covering collection, processing, pre-training mixing, and adaptation.
A practical catalog of data sources, filtering/deduplication/augmentation methods, and common multimodal datasets.
A review of data-driven selection methods and supervised / RL human-alignment data pipelines.
A summary of dataset evaluation metrics and benchmarks, plus concrete open problems and research directions.
Key Findings
Mixing image-caption, interleaved image-text, and text-only data at a 5:5:1 ratio gave best overall vision-language pretraining in a referenced study.
Large crawled corpora contain extreme duplicates (e.g., C4 had a 61-word sentence repeated >600k times).
Carefully selected small coresets can nearly match full-data fine-tuning: 0.5% core data gave only a 1–2% performance drop on a cited task.
Ranking image-text pairs by CLIP score and keeping the top 30% can substantially improve results on large-scale datasets.
High-resolution visual inputs improve detail-sensitive tasks: recent models increased input resolution from ~224px to 336–896px with gains in fine recognition.
Results
Image/Interleaved/Text mix
Duplicate example frequency
Coreset selection efficiency
Who Should Care
What To Try In 7 Days
Run simple CLIP-score filtering on your image-text pool and inspect top 30% vs bottom 30%
Deduplicate your text and image sets (exact string/hash filter) and measure storage savings
Collect 1–5k high-quality multimodal instruction pairs (detailed captions) and run a small SFT test
Agent Features
Tool Use
- LLMs (GPT-4/ChatGPT) used to rewrite captions and score/select data
Frameworks
- CiT
- Datacomp
- Doremi
Architectures
- modality encoder + projector + LLM backbone
Optimization Features
Infra Optimization
- data deduplication reduces wasted compute and storage
System Optimization
- use proxy models for data-mix tuning to reduce compute
Training Optimization
- domain mixture regression / proxy-model optimization
- modality mixture tuning (image/video/text ratios)
- selective freezing of components to save compute
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- No primary experiments — conclusions synthesize prior work rather than new empirical tests
- Many recommendations depend on external studies with different setups
- Lacks a single agreed multimodal data-quality metric and end-to-end MLLM data pipeline implementation
When Not To Use
- If you need new model architectures or core algorithmic innovations (paper focuses on data)
- If your application is single-modality text-only and doesn't need multimodal data guidance
Failure Modes
- Overfiltering can remove valuable long-tail or domain-specific examples
- Poor deduplication settings may miss semantic duplicates or remove legitimate near-duplicates
- Using LLMs as judges can bake in judge bias and reduce interpretability
- Imbalanced modality mixtures can bias the model toward static or temporal features
Core Entities
Models
- GPT-4
- Flamingo
- LLaVA
- BLIP2
- Cambrian-1
- X-InstructBLIP
- Vicuna
- LLaMA2
- ViT
- CLIP-ViT
- Qwen-VL
- OtterHD
- Monkey
- MiniGPT-4
Metrics
- CLIP score
- Vendi score
- MAUVE
- Wasserstein distance
- CORAL
- MMD
- CHAIR (object-hallucination)
- FAITH SCORE
- TRUE (factual consistency)
Datasets
- LAION-5B
- CC3M
- CC12M
- COCO
- Wukong
- WebVid
- Panda-70M
- WavCaps
- MSRVTT
- ScanNet
- MIMIC-CXR
- TextVQA
Benchmarks
- VQAv2
- GQA
- TextVQA
- MS-COCO
- MMBench
- MVBench
- MME
- VSR
- RefCOCO
Context Entities
Models
- BLiVA
- LLaVA-1.5
- LLAVA UHD
- Ureader
- OtterHD
Metrics
- precision/recall for generative models
- Task2Vec diversity coefficient
- perplexity (for selection)
- EL2U (influence)
Datasets
- DataComp/COMMONPOOL
- RedPajama
- The Pile
- Books3
- LAION-400M
- COYO-700M
Benchmarks
- Nocaps
- MSRVTT
- VATEX
- ActivityNet-QA
- DocVQA

