TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

October 27, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.45

Citation Count

3

Authors

Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra

Links

Abstract / PDF

Why It Matters For Business

TarGEN can create labeled training data from task descriptions without human seeds, reducing annotation cost and enabling model training for niche or proprietary tasks where examples don't exist.

Summary TLDR

TarGEN is a four-step, seedless prompting pipeline that uses large LMs (ChatGPT and others) to synthesize labeled datasets from task descriptions. It adds a single-step self-correction pass to relabel noisy outputs. On eight SuperGLUE tasks, models fine-tuned on TarGEN's synthetic data match or outperform models trained on original data (typical gains 1–5% accuracy). Synthetic data shows higher lexical and semantic diversity and a broader range of difficulty while exhibiting similar named-entity bias to originals.

Problem Statement

High-quality benchmarks are costly to create and many tasks have no labeled seed examples. Existing LM-driven data generation often depends on example seeds and produces low diversity. The paper asks: can a seedless, multi-step prompting pipeline plus an LM-based self-correction step synthesize labeled datasets that train competitive models?

Main Contribution

TarGEN: a 4-step, seedless prompting pipeline (contexts → instance seeds → label-constrained generation → self-correction) for targeted dataset generation.

Self-correction: one meta-prompt evaluation pass with an LM to relabel noisy instances and reduce mislabels.

Empirical study: synthetic SuperGLUE (8 tasks) shows models trained on synthetic data match or beat originals across model families and settings.

Analysis: synthetic data has higher lexical/semantic diversity, wider difficulty distribution (V-usable information), and similar named-entity bias to original data.

Key Findings

Models trained on TarGEN synthetic SuperGLUE match or improve over original-data-trained models.

NumbersAvg accuracy uplift: Og→Syn ≈ +1.1 to +2.8 percentage points across models (Table 6, Table 3)

Instruction tuning on synthetic data further improves performance.

NumbersInstruction tuning gains: Flan T5 +3.42%, Pythia GPT +3.24% (reported)

Self-correction meaningfully reduces label noise and boosts final accuracy.

NumbersT5-3B multi-task: average gain +5.9% when using self-correction (Table 5)

TarGEN-pretrained Llama2 (7B) beats a Self-Instruct pre-finetuned model on OpenLLM tasks.

NumbersOpenLLM average: L2SSG 49.2 vs L2SI 46.58 → +2.62 percentage points (Table 8)

Synthetic data is more lexically and semantically diverse and spans more difficulty levels.

NumbersAverage lexical diversity +25%; within-dataset cosine similarity lower for synth vs original (Fig 3, §5)

Results

Accuracy

ValueOg avg → Syn avg: +~1.1–2.1 pp across model families

BaselineOriginal (Og)

Instruction tuning improvement

ValueFlan T5: +3.42 pp; Pythia GPT: +3.24 pp

Baselinenon-instruction tuned variants

Multi-task finetuning gains (synthetic vs original)

ValueT5-3B: +4.73 pp; Llama2-7B: +3.21 pp; Mistral-7B: +2.94 pp (avg over tasks)

BaselineMulti-task on original data

Self-correction effect

ValueAvg +5.9 pp improvement with self-correction (T5-3B multi-task)

BaselineSynthetic without self-correction

Pre-finetune OpenLLM benchmark avg

ValueLlama2 (7B) pre-finetuned on TarGEN: 49.2 vs Self-Instruct: 46.58

BaselineSelf-Instruct pre-finetuned Llama2-7B

Who Should Care

What To Try In 7 Days

Run TarGEN on one low-data classification or NLI task you care about using ChatGPT and the provided prompts.

Add the one-step LM self-correction pass to relabel noisy outputs before fine-tuning.

Fine-tune a small target model (e.g., T5-large or Llama2-7B) on the synthetic set and compare against any available real data baseline.

Agent Features

Planning

  • multi-step prompt pipeline (contexts → seeds → label-constrained gen → self-correction)

Tool Use

  • ChatGPT for generation and evaluation
  • Llama-3 / Claude Sonnet used as alternative generators in ablation

Frameworks

  • TarGEN pipeline
  • self-correction meta-prompt

Optimization Features

Training Optimization

  • instruction tuning improves results

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • TarGEN depends on a closed-source LM (ChatGPT) in experiments; generalization to other LMs was tested but not extensively quantified.
  • Some datasets were truncated due to generation cost (ReCoRD, BoolQ), so scale effects are not fully measured.
  • Human evaluation was internal (authors); external, blind annotation was not performed.

When Not To Use

  • When legal or safety-critical labels require expert human annotation.
  • When you cannot afford LM API costs for large-scale generation and self-correction.
  • When strict provenance of source examples is required (TarGEN generates novel text).

Failure Modes

  • LLM hallucination producing unrealistic instances that pass automatic checks but mislead models.
  • Self-correction relying on the same LM could reinforce shared model biases instead of correcting them.
  • Context list or prompt design too narrow → drops diversity and reduces downstream gains.

Core Entities

Models

  • ChatGPT
  • Llama2-7B
  • T5-3B
  • Mistral-7B
  • RoBERTa-large
  • Pythia (410M)
  • Cerebras GPT
  • Flan T5 Large
  • Llama-3 (70B)
  • Claude Sonnet

Metrics

  • Accuracy
  • Rouge-L
  • V-usable information (dataset difficulty)
  • cosine similarity (semantic diversity)
  • lexical diversity (vocabulary counts)

Datasets

  • Synthetic SuperGLUE (TarGEN)
  • SuperGLUE (original)
  • OpenLLM benchmark

Benchmarks

  • SuperGLUE
  • OpenLLM