Overview
This is a comprehensive, literature-based survey; it compiles empirical findings from many sources but does not present new experiments, so its practical guidance is strong but relies on cited work.
Citations4
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/3
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Better data curation reduces compute and improves multimodal model reliability; selective filtering and high-quality instruction data can cut costs while keeping most performance.
Who Should Care
Summary TLDR
This paper surveys multimodal large language models (MLLMs) with a focus on the data side: where multimodal data come from, how to filter/deduplicate/augment it, how to mix modalities and domains during pre-training, and how to build and select instruction tuning and RLHF datasets. It compiles common datasets, concrete processing methods, evaluation metrics, and open problems such as missing multimodal data metrics, practical data pipelines, and unclear scaling laws for multimodal mixtures.
Problem Statement
MLLM progress is often driven by model and architecture work, but data — collection, cleaning, mixing, and alignment across modalities — is equally decisive. Practitioners lack a consolidated view of multimodal data pipelines, selection rules, and evaluation metrics tailored to MLLMs.
Main Contribution
A unified, data-centric taxonomy and pipeline for MLLMs covering collection, processing, pre-training mixing, and adaptation.
A practical catalog of data sources, filtering/deduplication/augmentation methods, and common multimodal datasets.
Key Findings
Mixing image-caption, interleaved image-text, and text-only data at a 5:5:1 ratio gave best overall vision-language pretraining in a referenced study.
Large crawled corpora contain extreme duplicates (e.g., C4 had a 61-word sentence repeated >600k times).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Image/Interleaved/Text mix | best at 5:5:1 | — | — | vision-language pretraining (MM1 study) | MM1 found 5:5:1 ratio gave best overall results | Section 4.2 |
| Duplicate example frequency | 61-word sentence repeated >600k times | — | — | C4 web crawl | C4 dataset contained extreme repetition leading to memorization risks | Section 3.2.2 |
What To Try In 7 Days
Run simple CLIP-score filtering on your image-text pool and inspect top 30% vs bottom 30%
Deduplicate your text and image sets (exact string/hash filter) and measure storage savings
Collect 1–5k high-quality multimodal instruction pairs (detailed captions) and run a small SFT test
Agent Features
Tool Use
Frameworks
Architectures
Optimization Features
Infra Optimization
System Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
No primary experiments — conclusions synthesize prior work rather than new empirical tests
Many recommendations depend on external studies with different setups
When Not To Use
If you need new model architectures or core algorithmic innovations (paper focuses on data)
If your application is single-modality text-only and doesn't need multimodal data guidance
Failure Modes
Overfiltering can remove valuable long-tail or domain-specific examples
Poor deduplication settings may miss semantic duplicates or remove legitimate near-duplicates

