100K synthetic image–caption pairs (SynthVLM-100K) give SOTA VLM results while using far less real data

July 30, 20248 min

Overview

Decision SnapshotNeeds Validation

The pipeline is practical and open-sourced, and experiments show consistent benchmark gains; however, results are shown on specific benchmarks and rely on SDXL and curated captions, so expect engineering work to adapt in other domains.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui

Links

Abstract / PDF / Code / Data

Why It Matters For Business

High-quality synthetic image–caption data can cut data storage and pretraining cost dramatically while preserving or improving VLM performance on vision and language tasks.

Who Should Care

Summary TLDR

SynthVLM builds a two-stage pipeline that (1) filters 1M high-quality captions, (2) uses Stable Diffusion XL (SDXL) to synthesize 1024×1024 images from those captions, and (3) selects the top 100K image–caption pairs by a weighted CLIPScore+SSIM metric (λ=0.5). The resulting SynthVLM-100K dataset achieves higher CLIP/SSIM alignment (curated: CLIP 0.36, SSIM 0.86, weighted 0.79) and trains VLMs (Vicuna-based 7B/13B) that outperform LLaVA baselines on VQA and MMLU using only ~18% of LLaVA pretraining data. Ablations show both generation and selection modules matter. Code and data are released on GitHub.

Problem Statement

High-quality, precisely aligned image–caption pairs are scarce. Web data often contains low-resolution images, watermarks, and noisy captions. Existing generation approaches either produce captions for images (not images) or require expensive human/GPT-4 labeling. This leads to poor alignment, wasted compute, and limited dataset utility for training VLMs.

Main Contribution

A two-stage pipeline (caption curation → image synthesis → CLIPScore+SSIM selection) to produce aligned synthetic image–caption data.

A curated SynthVLM-100K dataset of 100K high-resolution (1024×1024) synthetic image–caption pairs.

Key Findings

Curated synthetic data shows higher alignment and image fidelity than competing datasets.

NumbersSynthVLM-100K: CLIP 0.36, SSIM 0.86, weighted 0.79 (Table 4)

Practical UseUse CLIPScore+SSIM selection to pick synthetic pairs; this yields measurably better alignment than common web datasets on evaluated samples.

Evidence RefTable 4

Models trained on SynthVLM-100K outperform LLaVA baselines on vision and language benchmarks.

NumbersSynthVLM-7B MMLU 41.2 vs LLaVA-7B 36.3; SQA 70.4 vs 69.3 (Table 2 & 3)

Practical UseA well-curated 100K synthetic dataset can replace much larger web-scraped pretraining sets in VLM pipelines for similar or better benchmark performance.

Evidence RefTables 2 and 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
SQA (7B)70.4 (SynthVLM-7B)69.3 (LLaVA-7B)+1.1SQA benchmarkTable 2 shows SynthVLM-7B SQA 70.4 vs LLaVA-7B 69.3Table 2
MMLU (7B avg)41.2 (SynthVLM-7B)36.3 (LLaVA-7B)+4.9MMLU benchmarkTable 3 reports MMLU avg 41.2 vs 36.3 for 7BTable 3

What To Try In 7 Days

Filter an internal caption pool by CLIPScore and simple heuristics, then generate a small synthetic set with SDXL to test alignment.

Compute CLIPScore and SSIM on 1k generated images and pick the top 10–20% for quick model fine-tuning.

Run a small ablation: fine-tune a VLM on 10–100k curated synthetic pairs to compare against existing web data.

Agent Features

Tool Use
Diffusion image synthesis (SDXL)LLM filtering (LLaMA370B-Instruct)CLIP for scoring
Architectures
Vision-Language Model (VLM)

Optimization Features

Token Efficiency
Reduce pretraining data size to ~18–19% of baseline
Infra Optimization
Lower storage and data I/O costs for dataset construction
System Optimization
Store captions only for candidate pool to reduce storage footprint
Training Optimization
Use small, high-quality synthetic pretraining set instead of large noisy corpus

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Distribution gap: generated images may miss rare real-world details.

Dependence on SDXL quality; different generators may change outcomes.

When Not To Use

When tasks require authentic, real-person photos or exact real-world identifiers (legal/forensics).

When visual fine details tied to specific camera artifacts are critical.

Failure Modes

Generated images mismatch captions in ways not detectable by CLIP (semantic errors).

Selection may overfit to CLIP-style alignment and miss language nuance.

Core Entities

Models

Stable Diffusion XL (SDXL)CLIP (336px variant)Vicuna-1.5-7BVicuna-1.5-13BLLaVA 1.5LLaMA370B-InstructGPT-4 VisionIntern-VL2

Metrics

CLIPScoreSSIMWeighted CLIP+SSIMMMLU scoreSQA scoreMME cognition/perception scores

Datasets

SynthVLM-100KSynth Dataset (1M candidates)LLaVA-558KLLaVA-665KCOCO-CaptionBLIP-LCSShareGPT4VCC12MLAION / CC / SBU (caption sources)

Benchmarks

SQASQA_IMMVetVizWizVQAv2GQAMME (MME benchmark)PoPEMMLU