Synthetic Data Papers — Parsed & Scored for Practitioners

A simple, compute-efficient loop that generates model outputs, filters them by a learned reward, and fine-tunes the model offline to align L

0.60

0.70

18

ReST boosts model alignment with human preferences using offline data reuse, cutting the compute cost of repeated online RL sampling while improving translation quality and human-rated outputs.

Key finding

Each additional Improve step raises the model's average reward on validation.

Numbers: Figure 3: steady increases across IWSLT, WMT, Web Domain (rewards normalized 0–100)

Use ChatGPT to generate paraphrases and improve open-intent detection on compositionally different test sets

0.40

0.50

6

Adding LLM-generated paraphrases can cheaply raise intent-detection performance under realistic language variation, reducing missed or misrouted user requests and improving conversational UX.

Key finding

Adding ChatGPT paraphrases to BERT+ADB improves overall F1 on compositionally-challenging Banking_CG.

Numbers: F1-All: 54.87 -> 58.90 (+4.03)

Train a usable clinical LLM from 158k synthetic discharge summaries and share it publicly

0.20

0.65

0.70

5

You can train and host a capable clinical LLM without private patient notes, lowering legal barriers and API costs while keeping models runnable inside hospitals or on-prem.

Key finding

Synthetic notes reach realistic language statistics after conversion.

Numbers: Perplexity: synthetic 4.816 vs real hospital range 2.186–5.178

PeFAD: parameter-efficient federated anomaly detection using pre-trained language models

0.70

0.60

0.70

4

PeFAD lets organizations detect anomalies across distributed sensors without sharing raw data, lowering privacy risk and network cost while improving detection accuracy on real datasets.

Key finding

PeFAD outperforms federated baselines on four real datasets.

Numbers: F1 gains vs federated baselines: 3.83%–28.74% (evaluated datasets)

High-quality, LLM-distilled training data + Qwen2-0.5B yields top multilingual embeddings under 0.5B params

0.70

0.40

0.80

3

You can get strong multilingual retrieval and RAG performance with a compact 0.5B model by improving training data quality and using LLM-distilled synthetic examples, lowering cost vs larger models while keeping competitive accuracy.

Key finding

KaLM-embedding-mini-instruct is state-of-the-art for multilingual embeddings under 1B parameters on MTEB.

Numbers: MTEB avg 62.3; zh 64.13; en 64.94; fr 63.08; pl 57.05

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

0.60

0.55

0.45

3

TarGEN can create labeled training data from task descriptions without human seeds, reducing annotation cost and enabling model training for niche or proprietary tasks where examples don't exist.

Key finding

Models trained on TarGEN synthetic SuperGLUE match or improve over original-data-trained models.

Numbers: Avg accuracy uplift: Og→Syn ≈ +1.1 to +2.8 percentage points across models (Table 6, Table 3)

Generative AI can synthesize virtual IMU data to augment and pretrain HAR models

0.30

0.60

0.70

3

Synthetic IMU data can cut labeling costs and accelerate development of wearable activity features, but synthetic-to-real gaps require small calibration sets and validation for product safety.

Key finding

A text→motion→IMU pipeline can produce labeled virtual IMU data and boost HAR performance on standard datasets.

LLM-created training data hides biases and artifacts that can degrade models and amplify majority views

0.40

0.55

0.70

2

Synthetic LLM data can cut labeling costs but risks amplifying majority views, injecting errors, and reducing downstream accuracy (~10% in some tests). Validate and human-check synthetic data before production use.

Key finding

Models trained on LLM-generated preferences perform worse on human preference tests.

Numbers: ≈10% lower accuracy on human test sets (Table 8)

DistilDP: use a DP-finetuned teacher to generate private synthetic text and distill a compact student without applying DP twice

0.60

2

DistilDP lets you produce a smaller, private language model with better utility than privately fine-tuning the small model directly, reducing deployment cost while respecting strong DP budgets.

Key finding

DistilDP substantially reduces perplexity on Big Patent versus private fine-tuning baselines.

Numbers: Big Patent: DistilDP PPL 32.43 vs DP-SGD student 41.8 (−9.37 PPL)

SynEval: a compact framework to measure fidelity, utility and privacy of LLM-generated tabular reviews

0.50

0.60

0.50

2

SynEval helps teams judge if synthetic data is usable: it flags fidelity gaps, estimates downstream model performance, and surfaces privacy risk before data sharing.

Key finding

All three models preserved table columns and ordering exactly.

Numbers: Structure Preserving Score = 100% (Table 1)

Have a small instruction-tuned LLM? Make it a task expert by letting it synthesize its own training data and finetune on it.

0.60

0.70

2

You can boost a deployed 7B instruction-tuned LLM for a target task without buying a stronger teacher model or large labeled sets, cutting data costs and legal dependency while improving task accuracy.

Key finding

SELF-GUIDE improves classification Exact Match by ~14.5 absolute points over prompting on evaluated held-out tasks.

Numbers: Exact Match: baseline 33.2 → SELF-GUIDE 47.7; ∆=+14.5

Use small synthetic QA datasets and a PPL curriculum to boost Chinese and scientific reasoning in Llama‑3 with ~100B CPT tokens

0.60

2

You can adapt a large English-centric LLM to Chinese and science tasks with modest additional pretraining (~100B tokens) and targeted synthetic QA, improving domain value without full retrain.

Key finding

C-Eval (Chinese) improved by 8.81 points after CPT.

Numbers: C‑Eval: 49.43 → 58.24 (+8.81)

WELLA: fine-tuned LLM agents that generate dynamic workload estimates for multi‑operator nuclear control rooms

0.40

0.60

0.50

1

Automating realistic workload data reduces expert labor and enables faster, cheaper safety testing and training for multi-operator control rooms.

Key finding

WELLA predicts per-role workload with very high fit for RO3.

Numbers: RO3 R2=0.9628, RMSE=3.5327, MAE=1.92

Teach code models to be secure by synthesizing vuln/fix pairs and a two-step generate that adds needed libraries

0.65

0.70

0.60

1

HexaCoder gives a practical, automatable path to reduce insecure code generation from LLMs by synthesizing repair data and fine-tuning models, lowering security risk in AI-assisted coding without harming productivity.

Key finding

The synthesis pipeline repaired 1,776 out of 2,042 vulnerable samples.

Numbers: fixed 1776/2042 (≈87.0%)

SimSUM — 10K simulated EHR encounters linking tabular features and LLM-written clinical notes for multimodal CIE

0.20

0.60

0.30

1

SimSUM provides a safe, fast playground to build and test multimodal clinical extraction methods without patient data; integrating text with tabular EHR features improves extraction accuracy for subtle symptoms.

Key finding

Dataset size and design

Numbers: 10,000 records; 16 tabular features per record

EmojiLM: a seq2seq English↔Emoji translator trained on a 503K synthetic parallel corpus

0.60

1

Emoji-aware models enable richer user-facing features (emoji translation, emoji-labeled classification, UI localization) and improve low-data performance; the synthetic corpus offers a low-cost way to build such models.

Key finding

Built Text2Emoji corpus with half a million parallel examples.

Numbers: 503.7K instances; 2.3K emoji vocab; avg text len 15.18

IQC + MMIQC: generate diverse math word problems to raise open LLM math accuracy

0.60

1

If you need better math reasoning from open models, combine cleaned web QAs with focused synthetic augmentation (IQC) to get consistent, low-effort accuracy gains without external tools.

Key finding

Fine-tuning on MMIQC raises MATH accuracy for models of multiple sizes.

Numbers: Qwen-72B-MMIQC 45.0% (MATH); Qwen-72B baseline 35.2%

FinLLMs: Use formulas + LLMs to auto-generate QA datasets for financial numerical reasoning

0.60

0.70

1

You can scale financial QA training data cheaply by programmatically generating tables, text, and formula-backed answers; this lowers reliance on costly expert annotation while improving model accuracy on financial numerical tasks.

Key finding

Training with FinLLMs synthetic data improves model accuracy versus FinQA.

Numbers: EA +2.01% and PA +3.09% (FinQANet BERT: 53.01 vs 50.00 EA; 51.09 vs 48.00 PA)

Train a small code LLM to write and refine programs using hindsight relabeling and prioritized replay

0.40

0.70

0.60

1

CodeIt shows small, open code LMs can be iteratively improved to solve nontrivial program synthesis tasks, lowering dependency on expensive huge-LM APIs and enabling automated DSL-based tools for constrained domains.

Key finding

CodeIt solves more ARC tasks than prior methods.

Numbers: 59/400 tasks solved (pass@3, ARC Eval)

AUGCON: automatic pipeline to generate diverse, multi-granularity SFT pairs from any corpus

0.60

0.70

0.60

1

AUGCON creates high-quality, diverse SFT pairs automatically, lowering annotation costs and improving domain-adapted LLM performance for productized assistants and search/chat features.

Key finding

AUGCON improves accuracy on reading QA benchmarks compared to prior context-driven SFT methods.

Numbers: SQuAD1.1 Acc 0.336 vs 0.314 (best baseline); TriviaQA 0.849 vs 0.825; DROP 0.350 vs 0.334; WebGLM-QA BS 0.924 vs 0.903

Finetune LLMs on synthetic key-value tasks to improve long-context retrieval and reasoning without adding factual hallucinations

0.60

0.50

0.60

1

A small synthetic finetuning set can materially improve long-document retrieval and reasoning without adding factual hallucinations or hurting general abilities, making it a low-risk, low-cost upgrade for LLM products that handle long inputs.

Key finding

Finetuning on synthetic key-value tasks improves long-context retrieval accuracy.

Numbers: GPT-3.5 Turbo: +10.5% on 20-doc MDQA at position 10 (reported)

Iteratively prompt an LLM to produce filtered, diverse ABSA training data that rivals manual labels

0.60

0.45

0.65

1

IDG can produce usable labeled ABSA data from unlabeled text, lowering annotation cost and quickly bootstrapping sentiment models in new domains.

Key finding

IDG-generated data can match or exceed manual training data on ABSA models.

Numbers: R-GAT: Laptop14 F1 73.92→76.18 (+2.26); Rest14 F1 80.74→82.04 (+1.30)

100K synthetic image–caption pairs (SynthVLM-100K) give SOTA VLM results while using far less real data

0.60

0.50

0.70

1

High-quality synthetic image–caption data can cut data storage and pretraining cost dramatically while preserving or improving VLM performance on vision and language tasks.

Key finding

Curated synthetic data shows higher alignment and image fidelity than competing datasets.

Numbers: SynthVLM-100K: CLIP 0.36, SSIM 0.86, weighted 0.79 (Table 4)

GPT-4 can generate synthetic paragraphs and QA to improve some low-resource extractive QA datasets, but results depend on dataset size and c

0.50

0.40

0.60

1

LLMs can cheaply expand labeled training sets and reduce manual annotation for moderate low-resource QA domains, yielding measurable accuracy gains; but gains are dataset-dependent and fragile when labeled data is very scarce.

Key finding

On CovidQA, one-shot generation plus round-trip filtration improved RoBERTa EM and F1 over the original training set.

Numbers: EM 25.81 -> 31.90 (+6.09); F1 50.91 -> 58.66 (+7.75)