Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
High-quality synthetic image–caption data can cut data storage and pretraining cost dramatically while preserving or improving VLM performance on vision and language tasks.
Summary TLDR
SynthVLM builds a two-stage pipeline that (1) filters 1M high-quality captions, (2) uses Stable Diffusion XL (SDXL) to synthesize 1024×1024 images from those captions, and (3) selects the top 100K image–caption pairs by a weighted CLIPScore+SSIM metric (λ=0.5). The resulting SynthVLM-100K dataset achieves higher CLIP/SSIM alignment (curated: CLIP 0.36, SSIM 0.86, weighted 0.79) and trains VLMs (Vicuna-based 7B/13B) that outperform LLaVA baselines on VQA and MMLU using only ~18% of LLaVA pretraining data. Ablations show both generation and selection modules matter. Code and data are released on GitHub.
Problem Statement
High-quality, precisely aligned image–caption pairs are scarce. Web data often contains low-resolution images, watermarks, and noisy captions. Existing generation approaches either produce captions for images (not images) or require expensive human/GPT-4 labeling. This leads to poor alignment, wasted compute, and limited dataset utility for training VLMs.
Main Contribution
A two-stage pipeline (caption curation → image synthesis → CLIPScore+SSIM selection) to produce aligned synthetic image–caption data.
A curated SynthVLM-100K dataset of 100K high-resolution (1024×1024) synthetic image–caption pairs.
Demonstration that models pretrained on SynthVLM-100K (SynthVLM-7B/13B) outperform LLaVA baselines on multiple VQA and MMLU benchmarks while using far less pretraining data.
Open release of data-generation and curation code and the dataset link.
Key Findings
Curated synthetic data shows higher alignment and image fidelity than competing datasets.
Models trained on SynthVLM-100K outperform LLaVA baselines on vision and language benchmarks.
SynthVLM is far more storage- and compute-efficient than baseline pipelines.
Both image generation and selection modules materially improve final model quality.
Synthetic images were preferred by judges over web images in sampled tests.
Results
SQA (7B)
MMLU (7B avg)
SQA (13B)
MMLU (13B avg)
CLIPScore / SSIM / Weighted
Data usage for image–caption generation
Who Should Care
What To Try In 7 Days
Filter an internal caption pool by CLIPScore and simple heuristics, then generate a small synthetic set with SDXL to test alignment.
Compute CLIPScore and SSIM on 1k generated images and pick the top 10–20% for quick model fine-tuning.
Run a small ablation: fine-tune a VLM on 10–100k curated synthetic pairs to compare against existing web data.
Agent Features
Tool Use
- Diffusion image synthesis (SDXL)
- LLM filtering (LLaMA370B-Instruct)
- CLIP for scoring
Architectures
- Vision-Language Model (VLM)
Optimization Features
Token Efficiency
- Reduce pretraining data size to ~18–19% of baseline
Infra Optimization
- Lower storage and data I/O costs for dataset construction
System Optimization
- Store captions only for candidate pool to reduce storage footprint
Training Optimization
- Use small, high-quality synthetic pretraining set instead of large noisy corpus
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Distribution gap: generated images may miss rare real-world details.
- Dependence on SDXL quality; different generators may change outcomes.
- Selection uses CLIP+SSIM which can miss semantic or factual errors not captured by these scores.
- Evaluation covers many benchmarks but not every real-world domain or downstream task.
- Potential caption bias: curated caption pool determines what is learned.
When Not To Use
- When tasks require authentic, real-person photos or exact real-world identifiers (legal/forensics).
- When visual fine details tied to specific camera artifacts are critical.
- If you need a dataset with unfiltered real-world distribution for bias auditing.
Failure Modes
- Generated images mismatch captions in ways not detectable by CLIP (semantic errors).
- Selection may overfit to CLIP-style alignment and miss language nuance.
- Synthetic images lacking real-world artifacts cause distribution shift in deployment.
- Caption pool quality limits diversity and may bias model behavior.
Core Entities
Models
- Stable Diffusion XL (SDXL)
- CLIP (336px variant)
- Vicuna-1.5-7B
- Vicuna-1.5-13B
- LLaVA 1.5
- LLaMA370B-Instruct
- GPT-4 Vision
- Intern-VL2
Metrics
- CLIPScore
- SSIM
- Weighted CLIP+SSIM
- MMLU score
- SQA score
- MME cognition/perception scores
Datasets
- SynthVLM-100K
- Synth Dataset (1M candidates)
- LLaVA-558K
- LLaVA-665K
- COCO-Caption
- BLIP-LCS
- ShareGPT4V
- CC12M
- LAION / CC / SBU (caption sources)
Benchmarks
- SQA
- SQA_I
- MMVet
- VizWiz
- VQAv2
- GQA
- MME (MME benchmark)
- PoPE
- MMLU

