GPT-4 can generate synthetic paragraphs and QA to improve some low-resource extractive QA datasets, but results depend on dataset size and c

September 21, 20236 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

1

Authors

Vinay Samuel, Houda Aynaou, Arijit Ghosh Chowdhury, Karthik Venkat Ramanan, Aman Chadha

Links

Abstract / PDF

Why It Matters For Business

LLMs can cheaply expand labeled training sets and reduce manual annotation for moderate low-resource QA domains, yielding measurable accuracy gains; but gains are dataset-dependent and fragile when labeled data is very scarce.

Summary TLDR

The authors use GPT-4 to synthesize paragraph contexts and extractive question-answer pairs via one- or two-shot prompting, then filter outputs with a round-trip consistency test. Fine-tuning RoBERTa-Base on the augmented data improves Exact Match (EM) and F1 on two moderate low-resource datasets (CovidQA: EM +6.1 pts, F1 +7.8 pts; PolicyQA: EM +1.6 pts, F1 +1.5 pts) but does not reliably help the smallest dataset (TechQA). The pipeline and filtered augmented datasets are released for research.

Problem Statement

Annotating extractive reading-comprehension data is costly and many domains have too few labeled examples. The paper asks: can a large LLM (GPT-4) generate synthetic contexts and QA pairs to substitute or supplement human labels for low-resource extractive QA?

Main Contribution

A two-stage GPT-4 pipeline that first synthesizes paragraph contexts (one/two-shot) and then generates QA pairs conditioned on those contexts.

A round-trip (cycle-consistency) filter: regenerate an answer given the question and keep QA pairs only if answers match.

Empirical study on three low-resource datasets (CovidQA, PolicyQA, TechQA) showing consistent gains on CovidQA and PolicyQA but not on TechQA.

Release of augmented versions of the three low-resource datasets to encourage follow-up work.

Key Findings

On CovidQA, one-shot generation plus round-trip filtration improved RoBERTa EM and F1 over the original training set.

NumbersEM 25.81 -> 31.90 (+6.09); F1 50.91 -> 58.66 (+7.75)

On PolicyQA, unfiltered one-shot synthetic data produced modest improvements.

NumbersEM 30.56 -> 32.18 (+1.62); F1 58.15 -> 59.61 (+1.46)

On TechQA (smallest set), synthetic augmentation did not reliably improve performance and evaluation was high-variance.

NumbersAugmented EMs 11.11–22.22 vs best baseline EM 44.44; tiny test set (9 examples) increases variance

Results

Exact Match

Value31.90

Baseline25.81 (original trainset)

F1 Score

Value58.66

Baseline50.91 (original trainset)

Exact Match

Value32.18

Baseline30.56 (original trainset)

F1 Score

Value59.61

Baseline58.15 (original trainset)

Exact Match

Value22.22

Baseline44.44 (best baseline)

Who Should Care

What To Try In 7 Days

Run one-shot GPT-4 to synthesize paragraph contexts from a handful of in-domain examples.

Generate QA pairs conditioned on those synthetic contexts.

Apply round-trip filtration: regenerate answers and keep QA pairs when answers match exactly to boost precision.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends on original training size; very small datasets may not benefit.
  • Evaluation variance is high when test sets are tiny (TechQA test had 9 examples).
  • No cost or latency figures for GPT-4 generation in the paper.
  • Possible domain mismatch: synthetic contexts may miss real-world technical detail.

When Not To Use

  • When your labeled training set is extremely small (≈1–2k or fewer) and test sets are tiny.
  • For very technical domains where one-shot prompts cannot capture domain breadth.
  • If you cannot afford the API cost or need deterministic human-level labels.

Failure Modes

  • Generated contexts or answers may hallucinate facts not present in real documents.
  • Round-trip filtering trades recall for precision and may remove useful diverse examples.
  • Models fine-tuned on synthetic data may overfit synthetic patterns and fail on real distributions.
  • High evaluation variance with tiny test sets can hide true performance.

Core Entities

Models

  • GPT-4
  • RoBERTa-Base
  • T5 (question generation baseline)

Metrics

  • Exact Match
  • F1

Datasets

  • CovidQA (2,019 QA pairs)
  • PolicyQA (12,102 QA pairs)
  • TechQA (1,808 examples)

Context Entities

Models

  • GPT-3 (cited)
  • GPT-2 (cited)