TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

October 27, 20238 min

Overview

Decision SnapshotReady For Pilot

The pipeline is simple to run with ChatGPT and shows consistent multi-model gains on SuperGLUE; cost depends on LM API use and human review budget for validation.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 55%

Authors

Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra

Links

Abstract / PDF / Code

Why It Matters For Business

TarGEN can create labeled training data from task descriptions without human seeds, reducing annotation cost and enabling model training for niche or proprietary tasks where examples don't exist.

Who Should Care

Summary TLDR

TarGEN is a four-step, seedless prompting pipeline that uses large LMs (ChatGPT and others) to synthesize labeled datasets from task descriptions. It adds a single-step self-correction pass to relabel noisy outputs. On eight SuperGLUE tasks, models fine-tuned on TarGEN's synthetic data match or outperform models trained on original data (typical gains 1–5% accuracy). Synthetic data shows higher lexical and semantic diversity and a broader range of difficulty while exhibiting similar named-entity bias to originals.

Problem Statement

High-quality benchmarks are costly to create and many tasks have no labeled seed examples. Existing LM-driven data generation often depends on example seeds and produces low diversity. The paper asks: can a seedless, multi-step prompting pipeline plus an LM-based self-correction step synthesize labeled datasets that train competitive models?

Main Contribution

TarGEN: a 4-step, seedless prompting pipeline (contexts → instance seeds → label-constrained generation → self-correction) for targeted dataset generation.

Self-correction: one meta-prompt evaluation pass with an LM to relabel noisy instances and reduce mislabels.

Key Findings

Models trained on TarGEN synthetic SuperGLUE match or improve over original-data-trained models.

NumbersAvg accuracy uplift: Og→Syn ≈ +1.1 to +2.8 percentage points across models (Table 6, Table 3)

Practical UseYou can generate training data without human seeds and still get equal or better finetuning results on standard tasks; try TarGEN when original labels are missing.

Evidence RefTable 6, Table 3

Instruction tuning on synthetic data further improves performance.

NumbersInstruction tuning gains: Flan T5 +3.42%, Pythia GPT +3.24% (reported)

Practical UseIf you plan to instruction-tune, apply instruction-tuning steps on the synthetic set to gain a ~3% boost.

Evidence Ref§4.1, Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyOg avg → Syn avg: +~1.12.1 pp across model familiesOriginal (Og)≈ +13 ppAverage across eight SuperGLUE tasks (Table 6)Table 6 shows Og vs Syn and Og-I vs Syn-I averagesTable 6
Instruction tuning improvementFlan T5: +3.42 pp; Pythia GPT: +3.24 ppnon-instruction tuned variants+3.24 to +3.42 ppSingle-task settings (Table 3, §4.1)§4.1, Table 3Table 3

What To Try In 7 Days

Run TarGEN on one low-data classification or NLI task you care about using ChatGPT and the provided prompts.

Add the one-step LM self-correction pass to relabel noisy outputs before fine-tuning.

Fine-tune a small target model (e.g., T5-large or Llama2-7B) on the synthetic set and compare against any available real data baseline.

Agent Features

Planning
multi-step prompt pipeline (contexts → seeds → label-constrained gen → self-correction)
Tool Use
ChatGPT for generation and evaluationLlama-3 / Claude Sonnet used as alternative generators in ablation
Frameworks
TarGEN pipelineself-correction meta-prompt

Optimization Features

Training Optimization
instruction tuning improves results

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

TarGEN depends on a closed-source LM (ChatGPT) in experiments; generalization to other LMs was tested but not extensively quantified.

Some datasets were truncated due to generation cost (ReCoRD, BoolQ), so scale effects are not fully measured.

When Not To Use

When legal or safety-critical labels require expert human annotation.

When you cannot afford LM API costs for large-scale generation and self-correction.

Failure Modes

LLM hallucination producing unrealistic instances that pass automatic checks but mislead models.

Self-correction relying on the same LM could reinforce shared model biases instead of correcting them.

Core Entities

Models

ChatGPTLlama2-7BT5-3BMistral-7BRoBERTa-largePythia (410M)Cerebras GPTFlan T5 LargeLlama-3 (70B)Claude Sonnet

Metrics

AccuracyRouge-LV-usable information (dataset difficulty)cosine similarity (semantic diversity)lexical diversity (vocabulary counts)

Datasets

Synthetic SuperGLUE (TarGEN)SuperGLUE (original)OpenLLM benchmark

Benchmarks

SuperGLUEOpenLLM