100K synthetic image–caption pairs (SynthVLM-100K) give SOTA VLM results while using far less real data

Overview

Decision SnapshotNeeds Validation

The pipeline is practical and open-sourced, and experiments show consistent benchmark gains; however, results are shown on specific benchmarks and rely on SDXL and curated captions, so expect engineering work to adapt in other domains.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui

Links

Abstract / PDF / Code / Data

Why It Matters For Business

High-quality synthetic image–caption data can cut data storage and pretraining cost dramatically while preserving or improving VLM performance on vision and language tasks.

Who Should Care

ML Engineer Data Scientist CTO Engineering Lead Product Manager

Summary TLDR

SynthVLM builds a two-stage pipeline that (1) filters 1M high-quality captions, (2) uses Stable Diffusion XL (SDXL) to synthesize 1024×1024 images from those captions, and (3) selects the top 100K image–caption pairs by a weighted CLIPScore+SSIM metric (λ=0.5). The resulting SynthVLM-100K dataset achieves higher CLIP/SSIM alignment (curated: CLIP 0.36, SSIM 0.86, weighted 0.79) and trains VLMs (Vicuna-based 7B/13B) that outperform LLaVA baselines on VQA and MMLU using only ~18% of LLaVA pretraining data. Ablations show both generation and selection modules matter. Code and data are released on GitHub.

Problem Statement

High-quality, precisely aligned image–caption pairs are scarce. Web data often contains low-resolution images, watermarks, and noisy captions. Existing generation approaches either produce captions for images (not images) or require expensive human/GPT-4 labeling. This leads to poor alignment, wasted compute, and limited dataset utility for training VLMs.

Main Contribution

A two-stage pipeline (caption curation → image synthesis → CLIPScore+SSIM selection) to produce aligned synthetic image–caption data.

A curated SynthVLM-100K dataset of 100K high-resolution (1024×1024) synthetic image–caption pairs.

Key Findings

Curated synthetic data shows higher alignment and image fidelity than competing datasets.

NumbersSynthVLM-100K: CLIP 0.36, SSIM 0.86, weighted 0.79 (Table 4)

Practical UseUse CLIPScore+SSIM selection to pick synthetic pairs; this yields measurably better alignment than common web datasets on evaluated samples.

Evidence RefTable 4

Models trained on SynthVLM-100K outperform LLaVA baselines on vision and language benchmarks.

NumbersSynthVLM-7B MMLU 41.2 vs LLaVA-7B 36.3; SQA 70.4 vs 69.3 (Table 2 & 3)

Practical UseA well-curated 100K synthetic dataset can replace much larger web-scraped pretraining sets in VLM pipelines for similar or better benchmark performance.

Evidence RefTables 2 and 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
SQA (7B)	70.4 (SynthVLM-7B)	69.3 (LLaVA-7B)	+1.1	SQA benchmark	Table 2 shows SynthVLM-7B SQA 70.4 vs LLaVA-7B 69.3	Table 2
MMLU (7B avg)	41.2 (SynthVLM-7B)	36.3 (LLaVA-7B)	+4.9	MMLU benchmark	Table 3 reports MMLU avg 41.2 vs 36.3 for 7B	Table 3

What To Try In 7 Days

Filter an internal caption pool by CLIPScore and simple heuristics, then generate a small synthetic set with SDXL to test alignment.

Compute CLIPScore and SSIM on 1k generated images and pick the top 10–20% for quick model fine-tuning.

Run a small ablation: fine-tune a VLM on 10–100k curated synthetic pairs to compare against existing web data.

Agent Features

Tool Use

Diffusion image synthesis (SDXL)LLM filtering (LLaMA370B-Instruct)CLIP for scoring

Architectures

Vision-Language Model (VLM)

Optimization Features

Token Efficiency

Reduce pretraining data size to ~18–19% of baseline

Infra Optimization

Lower storage and data I/O costs for dataset construction

System Optimization

Store captions only for candidate pool to reduce storage footprint

Training Optimization

Use small, high-quality synthetic pretraining set instead of large noisy corpus

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/starriver030515/SynthVLM

Data URLs

https://github.com/starriver030515/SynthVLM

Risks & Boundaries

Limitations

Distribution gap: generated images may miss rare real-world details.

Dependence on SDXL quality; different generators may change outcomes.

When Not To Use

When tasks require authentic, real-person photos or exact real-world identifiers (legal/forensics).

When visual fine details tied to specific camera artifacts are critical.

Failure Modes

Generated images mismatch captions in ways not detectable by CLIP (semantic errors).

Selection may overfit to CLIP-style alignment and miss language nuance.

Core Entities

Models

Stable Diffusion XL (SDXL)CLIP (336px variant)Vicuna-1.5-7BVicuna-1.5-13BLLaVA 1.5LLaMA370B-InstructGPT-4 VisionIntern-VL2

Metrics

CLIPScoreSSIMWeighted CLIP+SSIMMMLU scoreSQA scoreMME cognition/perception scores

Datasets

SynthVLM-100KSynth Dataset (1M candidates)LLaVA-558KLLaVA-665KCOCO-CaptionBLIP-LCSShareGPT4VCC12MLAION / CC / SBU (caption sources)

Benchmarks

SQASQA_IMMVetVizWizVQAv2GQAMME (MME benchmark)PoPEMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Curated synthetic data shows higher alignment and image fidelity than competing datasets.

Models trained on SynthVLM-100K outperform LLaVA baselines on vision and language benchmarks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding