Survey: how data choices shape multimodal LLMs — pipelines, filters, and open gaps

May 26, 20247 min

Overview

Decision SnapshotReady For Pilot

This is a comprehensive, literature-based survey; it compiles empirical findings from many sources but does not present new experiments, so its practical guidance is strong but relies on cited work.

Citations4

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Ping Huang, Jiulong Shan, Conghui He, Binhang Yuan, Wentao Zhang

Links

Abstract / PDF

Why It Matters For Business

Better data curation reduces compute and improves multimodal model reliability; selective filtering and high-quality instruction data can cut costs while keeping most performance.

Who Should Care

Summary TLDR

This paper surveys multimodal large language models (MLLMs) with a focus on the data side: where multimodal data come from, how to filter/deduplicate/augment it, how to mix modalities and domains during pre-training, and how to build and select instruction tuning and RLHF datasets. It compiles common datasets, concrete processing methods, evaluation metrics, and open problems such as missing multimodal data metrics, practical data pipelines, and unclear scaling laws for multimodal mixtures.

Problem Statement

MLLM progress is often driven by model and architecture work, but data — collection, cleaning, mixing, and alignment across modalities — is equally decisive. Practitioners lack a consolidated view of multimodal data pipelines, selection rules, and evaluation metrics tailored to MLLMs.

Main Contribution

A unified, data-centric taxonomy and pipeline for MLLMs covering collection, processing, pre-training mixing, and adaptation.

A practical catalog of data sources, filtering/deduplication/augmentation methods, and common multimodal datasets.

Key Findings

Mixing image-caption, interleaved image-text, and text-only data at a 5:5:1 ratio gave best overall vision-language pretraining in a referenced study.

Numbersratio 5:5:1 reported by MM1

Practical UseWhen pretraining vision-language MLLMs, start with roughly equal amounts of image-caption and interleaved image-text data plus a smaller text-only corpus; tune from that baseline.

Evidence RefMM1 / section 4.2

Large crawled corpora contain extreme duplicates (e.g., C4 had a 61-word sentence repeated >600k times).

Numbers61-word sentence repeated >600k times (C4)

Practical UseRun deduplication early. Remove exact/near duplicates to reduce memorization risk and training waste.

Evidence RefSection 3.2.2, cites C4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Image/Interleaved/Text mixbest at 5:5:1vision-language pretraining (MM1 study)MM1 found 5:5:1 ratio gave best overall resultsSection 4.2
Duplicate example frequency61-word sentence repeated >600k timesC4 web crawlC4 dataset contained extreme repetition leading to memorization risksSection 3.2.2

What To Try In 7 Days

Run simple CLIP-score filtering on your image-text pool and inspect top 30% vs bottom 30%

Deduplicate your text and image sets (exact string/hash filter) and measure storage savings

Collect 1–5k high-quality multimodal instruction pairs (detailed captions) and run a small SFT test

Agent Features

Tool Use
LLMs (GPT-4/ChatGPT) used to rewrite captions and score/select data
Frameworks
CiTDatacompDoremi
Architectures
modality encoder + projector + LLM backbone

Optimization Features

Infra Optimization
data deduplication reduces wasted compute and storage
System Optimization
use proxy models for data-mix tuning to reduce compute
Training Optimization
domain mixture regression / proxy-model optimizationmodality mixture tuning (image/video/text ratios)selective freezing of components to save compute

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

No primary experiments — conclusions synthesize prior work rather than new empirical tests

Many recommendations depend on external studies with different setups

When Not To Use

If you need new model architectures or core algorithmic innovations (paper focuses on data)

If your application is single-modality text-only and doesn't need multimodal data guidance

Failure Modes

Overfiltering can remove valuable long-tail or domain-specific examples

Poor deduplication settings may miss semantic duplicates or remove legitimate near-duplicates

Core Entities

Models

GPT-4FlamingoLLaVABLIP2Cambrian-1X-InstructBLIPVicunaLLaMA2ViTCLIP-ViTQwen-VLOtterHDMonkeyMiniGPT-4

Metrics

CLIP scoreVendi scoreMAUVEWasserstein distanceCORALMMDCHAIR (object-hallucination)FAITH SCORETRUE (factual consistency)

Datasets

LAION-5BCC3MCC12MCOCOWukongWebVidPanda-70MWavCapsMSRVTTScanNetMIMIC-CXRTextVQA

Benchmarks

VQAv2GQATextVQAMS-COCOMMBenchMVBenchMMEVSRRefCOCO

Context Entities

Models

BLiVALLaVA-1.5LLAVA UHDUreaderOtterHD

Metrics

precision/recall for generative modelsTask2Vec diversity coefficientperplexity (for selection)EL2U (influence)

Datasets

DataComp/COMMONPOOLRedPajamaThe PileBooks3LAION-400MCOYO-700M

Benchmarks

NocapsMSRVTTVATEXActivityNet-QADocVQA