100K synthetic image–caption pairs (SynthVLM-100K) give SOTA VLM results while using far less real data

July 30, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

1

Authors

Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui

Links

Abstract / PDF

Why It Matters For Business

High-quality synthetic image–caption data can cut data storage and pretraining cost dramatically while preserving or improving VLM performance on vision and language tasks.

Summary TLDR

SynthVLM builds a two-stage pipeline that (1) filters 1M high-quality captions, (2) uses Stable Diffusion XL (SDXL) to synthesize 1024×1024 images from those captions, and (3) selects the top 100K image–caption pairs by a weighted CLIPScore+SSIM metric (λ=0.5). The resulting SynthVLM-100K dataset achieves higher CLIP/SSIM alignment (curated: CLIP 0.36, SSIM 0.86, weighted 0.79) and trains VLMs (Vicuna-based 7B/13B) that outperform LLaVA baselines on VQA and MMLU using only ~18% of LLaVA pretraining data. Ablations show both generation and selection modules matter. Code and data are released on GitHub.

Problem Statement

High-quality, precisely aligned image–caption pairs are scarce. Web data often contains low-resolution images, watermarks, and noisy captions. Existing generation approaches either produce captions for images (not images) or require expensive human/GPT-4 labeling. This leads to poor alignment, wasted compute, and limited dataset utility for training VLMs.

Main Contribution

A two-stage pipeline (caption curation → image synthesis → CLIPScore+SSIM selection) to produce aligned synthetic image–caption data.

A curated SynthVLM-100K dataset of 100K high-resolution (1024×1024) synthetic image–caption pairs.

Demonstration that models pretrained on SynthVLM-100K (SynthVLM-7B/13B) outperform LLaVA baselines on multiple VQA and MMLU benchmarks while using far less pretraining data.

Open release of data-generation and curation code and the dataset link.

Key Findings

Curated synthetic data shows higher alignment and image fidelity than competing datasets.

NumbersSynthVLM-100K: CLIP 0.36, SSIM 0.86, weighted 0.79 (Table 4)

Models trained on SynthVLM-100K outperform LLaVA baselines on vision and language benchmarks.

NumbersSynthVLM-7B MMLU 41.2 vs LLaVA-7B 36.3; SQA 70.4 vs 69.3 (Table 2 & 3)

SynthVLM is far more storage- and compute-efficient than baseline pipelines.

NumbersData usage reported: SynthVLM 33MB vs LLaVA 27GB (Table 7); authors claim models use 18–19% of LLaVA data

Both image generation and selection modules materially improve final model quality.

NumbersAblation drops: SynthVLM-7B MMLU avg 41.2 → 39.1 without generation, → 40.6 without selection (Table 9)

Synthetic images were preferred by judges over web images in sampled tests.

NumbersOn 1K samples, GPT4-Vision: synthetic 633 vs web 367; InternVL2: 692 vs 308; humans: 758 vs 242 (Table 6)

Results

SQA (7B)

Value70.4 (SynthVLM-7B)

Baseline69.3 (LLaVA-7B)

MMLU (7B avg)

Value41.2 (SynthVLM-7B)

Baseline36.3 (LLaVA-7B)

SQA (13B)

Value74.9 (SynthVLM-13B)

Baseline74.2 (LLaVA-13B)

MMLU (13B avg)

Value54.6 (SynthVLM-13B)

Baseline52.4 (LLaVA-13B)

CLIPScore / SSIM / Weighted

Value0.36 / 0.86 / 0.79 (SynthVLM-100K curated)

BaselineShareGPT4V 0.32 / 0.79 / 0.71; COCO/BLIP-LCS ~0.31-0.32 / 0.73-0.75

Data usage for image–caption generation

Value33MB (SynthVLM)

Baseline27GB (LLaVA)

Who Should Care

What To Try In 7 Days

Filter an internal caption pool by CLIPScore and simple heuristics, then generate a small synthetic set with SDXL to test alignment.

Compute CLIPScore and SSIM on 1k generated images and pick the top 10–20% for quick model fine-tuning.

Run a small ablation: fine-tune a VLM on 10–100k curated synthetic pairs to compare against existing web data.

Agent Features

Tool Use

  • Diffusion image synthesis (SDXL)
  • LLM filtering (LLaMA370B-Instruct)
  • CLIP for scoring

Architectures

  • Vision-Language Model (VLM)

Optimization Features

Token Efficiency

  • Reduce pretraining data size to ~18–19% of baseline

Infra Optimization

  • Lower storage and data I/O costs for dataset construction

System Optimization

  • Store captions only for candidate pool to reduce storage footprint

Training Optimization

  • Use small, high-quality synthetic pretraining set instead of large noisy corpus

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Distribution gap: generated images may miss rare real-world details.
  • Dependence on SDXL quality; different generators may change outcomes.
  • Selection uses CLIP+SSIM which can miss semantic or factual errors not captured by these scores.
  • Evaluation covers many benchmarks but not every real-world domain or downstream task.
  • Potential caption bias: curated caption pool determines what is learned.

When Not To Use

  • When tasks require authentic, real-person photos or exact real-world identifiers (legal/forensics).
  • When visual fine details tied to specific camera artifacts are critical.
  • If you need a dataset with unfiltered real-world distribution for bias auditing.

Failure Modes

  • Generated images mismatch captions in ways not detectable by CLIP (semantic errors).
  • Selection may overfit to CLIP-style alignment and miss language nuance.
  • Synthetic images lacking real-world artifacts cause distribution shift in deployment.
  • Caption pool quality limits diversity and may bias model behavior.

Core Entities

Models

  • Stable Diffusion XL (SDXL)
  • CLIP (336px variant)
  • Vicuna-1.5-7B
  • Vicuna-1.5-13B
  • LLaVA 1.5
  • LLaMA370B-Instruct
  • GPT-4 Vision
  • Intern-VL2

Metrics

  • CLIPScore
  • SSIM
  • Weighted CLIP+SSIM
  • MMLU score
  • SQA score
  • MME cognition/perception scores

Datasets

  • SynthVLM-100K
  • Synth Dataset (1M candidates)
  • LLaVA-558K
  • LLaVA-665K
  • COCO-Caption
  • BLIP-LCS
  • ShareGPT4V
  • CC12M
  • LAION / CC / SBU (caption sources)

Benchmarks

  • SQA
  • SQA_I
  • MMVet
  • VizWiz
  • VQAv2
  • GQA
  • MME (MME benchmark)
  • PoPE
  • MMLU