Survey: how data choices shape multimodal LLMs — pipelines, filters, and open gaps

May 26, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

4

Authors

Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Ping Huang, Jiulong Shan, Conghui He, Binhang Yuan, Wentao Zhang

Links

Abstract / PDF

Why It Matters For Business

Better data curation reduces compute and improves multimodal model reliability; selective filtering and high-quality instruction data can cut costs while keeping most performance.

Summary TLDR

This paper surveys multimodal large language models (MLLMs) with a focus on the data side: where multimodal data come from, how to filter/deduplicate/augment it, how to mix modalities and domains during pre-training, and how to build and select instruction tuning and RLHF datasets. It compiles common datasets, concrete processing methods, evaluation metrics, and open problems such as missing multimodal data metrics, practical data pipelines, and unclear scaling laws for multimodal mixtures.

Problem Statement

MLLM progress is often driven by model and architecture work, but data — collection, cleaning, mixing, and alignment across modalities — is equally decisive. Practitioners lack a consolidated view of multimodal data pipelines, selection rules, and evaluation metrics tailored to MLLMs.

Main Contribution

A unified, data-centric taxonomy and pipeline for MLLMs covering collection, processing, pre-training mixing, and adaptation.

A practical catalog of data sources, filtering/deduplication/augmentation methods, and common multimodal datasets.

A review of data-driven selection methods and supervised / RL human-alignment data pipelines.

A summary of dataset evaluation metrics and benchmarks, plus concrete open problems and research directions.

Key Findings

Mixing image-caption, interleaved image-text, and text-only data at a 5:5:1 ratio gave best overall vision-language pretraining in a referenced study.

Numbersratio 5:5:1 reported by MM1

Large crawled corpora contain extreme duplicates (e.g., C4 had a 61-word sentence repeated >600k times).

Numbers61-word sentence repeated >600k times (C4)

Carefully selected small coresets can nearly match full-data fine-tuning: 0.5% core data gave only a 1–2% performance drop on a cited task.

Numbers0.5% core → 1–2% lower performance (Chen et al.)

Ranking image-text pairs by CLIP score and keeping the top 30% can substantially improve results on large-scale datasets.

Numberstop 30% by CLIP score improves results (Datacomp reference)

High-resolution visual inputs improve detail-sensitive tasks: recent models increased input resolution from ~224px to 336–896px with gains in fine recognition.

Numbersresolutions cited: 224 → 336/448/896 px (multiple models)

Results

Image/Interleaved/Text mix

Valuebest at 5:5:1

Duplicate example frequency

Value61-word sentence repeated >600k times

Coreset selection efficiency

Value0.5% data → 1–2% lower performance

Baselinefull dataset

Who Should Care

What To Try In 7 Days

Run simple CLIP-score filtering on your image-text pool and inspect top 30% vs bottom 30%

Deduplicate your text and image sets (exact string/hash filter) and measure storage savings

Collect 1–5k high-quality multimodal instruction pairs (detailed captions) and run a small SFT test

Agent Features

Tool Use

  • LLMs (GPT-4/ChatGPT) used to rewrite captions and score/select data

Frameworks

  • CiT
  • Datacomp
  • Doremi

Architectures

  • modality encoder + projector + LLM backbone

Optimization Features

Infra Optimization

  • data deduplication reduces wasted compute and storage

System Optimization

  • use proxy models for data-mix tuning to reduce compute

Training Optimization

  • domain mixture regression / proxy-model optimization
  • modality mixture tuning (image/video/text ratios)
  • selective freezing of components to save compute

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No primary experiments — conclusions synthesize prior work rather than new empirical tests
  • Many recommendations depend on external studies with different setups
  • Lacks a single agreed multimodal data-quality metric and end-to-end MLLM data pipeline implementation

When Not To Use

  • If you need new model architectures or core algorithmic innovations (paper focuses on data)
  • If your application is single-modality text-only and doesn't need multimodal data guidance

Failure Modes

  • Overfiltering can remove valuable long-tail or domain-specific examples
  • Poor deduplication settings may miss semantic duplicates or remove legitimate near-duplicates
  • Using LLMs as judges can bake in judge bias and reduce interpretability
  • Imbalanced modality mixtures can bias the model toward static or temporal features

Core Entities

Models

  • GPT-4
  • Flamingo
  • LLaVA
  • BLIP2
  • Cambrian-1
  • X-InstructBLIP
  • Vicuna
  • LLaMA2
  • ViT
  • CLIP-ViT
  • Qwen-VL
  • OtterHD
  • Monkey
  • MiniGPT-4

Metrics

  • CLIP score
  • Vendi score
  • MAUVE
  • Wasserstein distance
  • CORAL
  • MMD
  • CHAIR (object-hallucination)
  • FAITH SCORE
  • TRUE (factual consistency)

Datasets

  • LAION-5B
  • CC3M
  • CC12M
  • COCO
  • Wukong
  • WebVid
  • Panda-70M
  • WavCaps
  • MSRVTT
  • ScanNet
  • MIMIC-CXR
  • TextVQA

Benchmarks

  • VQAv2
  • GQA
  • TextVQA
  • MS-COCO
  • MMBench
  • MVBench
  • MME
  • VSR
  • RefCOCO

Context Entities

Models

  • BLiVA
  • LLaVA-1.5
  • LLAVA UHD
  • Ureader
  • OtterHD

Metrics

  • precision/recall for generative models
  • Task2Vec diversity coefficient
  • perplexity (for selection)
  • EL2U (influence)

Datasets

  • DataComp/COMMONPOOL
  • RedPajama
  • The Pile
  • Books3
  • LAION-400M
  • COYO-700M

Benchmarks

  • Nocaps
  • MSRVTT
  • VATEX
  • ActivityNet-QA
  • DocVQA