TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Overview

Decision SnapshotReady For Pilot

The pipeline is simple to run with ChatGPT and shows consistent multi-model gains on SuperGLUE; cost depends on LM API use and human review budget for validation.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 55%

Authors

Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra

Links

Abstract / PDF / Code

Why It Matters For Business

TarGEN can create labeled training data from task descriptions without human seeds, reducing annotation cost and enabling model training for niche or proprietary tasks where examples don't exist.

Who Should Care

ML Engineer Data Scientist Product Manager Founder

Summary TLDR

TarGEN is a four-step, seedless prompting pipeline that uses large LMs (ChatGPT and others) to synthesize labeled datasets from task descriptions. It adds a single-step self-correction pass to relabel noisy outputs. On eight SuperGLUE tasks, models fine-tuned on TarGEN's synthetic data match or outperform models trained on original data (typical gains 1–5% accuracy). Synthetic data shows higher lexical and semantic diversity and a broader range of difficulty while exhibiting similar named-entity bias to originals.

Problem Statement

High-quality benchmarks are costly to create and many tasks have no labeled seed examples. Existing LM-driven data generation often depends on example seeds and produces low diversity. The paper asks: can a seedless, multi-step prompting pipeline plus an LM-based self-correction step synthesize labeled datasets that train competitive models?

Main Contribution

TarGEN: a 4-step, seedless prompting pipeline (contexts → instance seeds → label-constrained generation → self-correction) for targeted dataset generation.

Self-correction: one meta-prompt evaluation pass with an LM to relabel noisy instances and reduce mislabels.

Key Findings

Models trained on TarGEN synthetic SuperGLUE match or improve over original-data-trained models.

NumbersAvg accuracy uplift: Og→Syn ≈ +1.1 to +2.8 percentage points across models (Table 6, Table 3)

Practical UseYou can generate training data without human seeds and still get equal or better finetuning results on standard tasks; try TarGEN when original labels are missing.

Evidence RefTable 6, Table 3

Instruction tuning on synthetic data further improves performance.

NumbersInstruction tuning gains: Flan T5 +3.42%, Pythia GPT +3.24% (reported)

Practical UseIf you plan to instruction-tune, apply instruction-tuning steps on the synthetic set to gain a ~3% boost.

Evidence Ref§4.1, Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Og avg → Syn avg: +~1.1–2.1 pp across model families	Original (Og)	≈ +1–3 pp	Average across eight SuperGLUE tasks (Table 6)	Table 6 shows Og vs Syn and Og-I vs Syn-I averages	Table 6
Instruction tuning improvement	Flan T5: +3.42 pp; Pythia GPT: +3.24 pp	non-instruction tuned variants	+3.24 to +3.42 pp	Single-task settings (Table 3, §4.1)	§4.1, Table 3	Table 3

What To Try In 7 Days

Run TarGEN on one low-data classification or NLI task you care about using ChatGPT and the provided prompts.

Add the one-step LM self-correction pass to relabel noisy outputs before fine-tuning.

Fine-tune a small target model (e.g., T5-large or Llama2-7B) on the synthetic set and compare against any available real data baseline.

Agent Features

Planning

multi-step prompt pipeline (contexts → seeds → label-constrained gen → self-correction)

Tool Use

ChatGPT for generation and evaluationLlama-3 / Claude Sonnet used as alternative generators in ablation

Frameworks

TarGEN pipelineself-correction meta-prompt

Optimization Features

Training Optimization

instruction tuning improves results

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/kevinscaria/TarGEN

Risks & Boundaries

Limitations

TarGEN depends on a closed-source LM (ChatGPT) in experiments; generalization to other LMs was tested but not extensively quantified.

Some datasets were truncated due to generation cost (ReCoRD, BoolQ), so scale effects are not fully measured.

When Not To Use

When legal or safety-critical labels require expert human annotation.

When you cannot afford LM API costs for large-scale generation and self-correction.

Failure Modes

LLM hallucination producing unrealistic instances that pass automatic checks but mislead models.

Self-correction relying on the same LM could reinforce shared model biases instead of correcting them.

Core Entities

Models

ChatGPTLlama2-7BT5-3BMistral-7BRoBERTa-largePythia (410M)Cerebras GPTFlan T5 LargeLlama-3 (70B)Claude Sonnet

Metrics

AccuracyRouge-LV-usable information (dataset difficulty)cosine similarity (semantic diversity)lexical diversity (vocabulary counts)

Datasets

Synthetic SuperGLUE (TarGEN)SuperGLUE (original)OpenLLM benchmark

Benchmarks

SuperGLUEOpenLLM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Models trained on TarGEN synthetic SuperGLUE match or improve over original-data-trained models.

Instruction tuning on synthetic data further improves performance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

Finetune LLMs on synthetic key-value tasks to improve long-context retrieval and reasoning without adding factual hallucinations

Key finding