GPT-4 can generate synthetic paragraphs and QA to improve some low-resource extractive QA datasets, but results depend on dataset size and c

September 21, 20236 min

Overview

Decision SnapshotNeeds Validation

The method is practical and reproducible (RoBERTa fine-tuning, GPT-4 prompts). Evidence is limited to three datasets; results are convincing for moderate data sizes but weak for very small datasets.

Citations1

Evidence Strength0.60

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Vinay Samuel, Houda Aynaou, Arijit Ghosh Chowdhury, Karthik Venkat Ramanan, Aman Chadha

Links

Abstract / PDF

Why It Matters For Business

LLMs can cheaply expand labeled training sets and reduce manual annotation for moderate low-resource QA domains, yielding measurable accuracy gains; but gains are dataset-dependent and fragile when labeled data is very scarce.

Who Should Care

Summary TLDR

The authors use GPT-4 to synthesize paragraph contexts and extractive question-answer pairs via one- or two-shot prompting, then filter outputs with a round-trip consistency test. Fine-tuning RoBERTa-Base on the augmented data improves Exact Match (EM) and F1 on two moderate low-resource datasets (CovidQA: EM +6.1 pts, F1 +7.8 pts; PolicyQA: EM +1.6 pts, F1 +1.5 pts) but does not reliably help the smallest dataset (TechQA). The pipeline and filtered augmented datasets are released for research.

Problem Statement

Annotating extractive reading-comprehension data is costly and many domains have too few labeled examples. The paper asks: can a large LLM (GPT-4) generate synthetic contexts and QA pairs to substitute or supplement human labels for low-resource extractive QA?

Main Contribution

A two-stage GPT-4 pipeline that first synthesizes paragraph contexts (one/two-shot) and then generates QA pairs conditioned on those contexts.

A round-trip (cycle-consistency) filter: regenerate an answer given the question and keep QA pairs only if answers match.

Key Findings

On CovidQA, one-shot generation plus round-trip filtration improved RoBERTa EM and F1 over the original training set.

NumbersEM 25.81 -> 31.90 (+6.09); F1 50.91 -> 58.66 (+7.75)

Practical UseFor similar moderate low-resource domains, use one-shot GPT-4 generation plus round-trip filtering to get measurable QA gains.

Evidence RefTable 1 (CovidQA; One Shot (CC))

On PolicyQA, unfiltered one-shot synthetic data produced modest improvements.

NumbersEM 30.56 -> 32.18 (+1.62); F1 58.15 -> 59.61 (+1.46)

Practical UseWhen you already have thousands of examples, adding unfiltered one-shot synthetic data can give small but consistent boosts without extra filtering.

Evidence RefTable 1 (PolicyQA; One Shot)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Exact Match31.9025.81 (original trainset)+6.09CovidQA (validation)Table 1 (One Shot (CC) vs Original Trainset)Table 1
F1 Score58.6650.91 (original trainset)+7.75CovidQA (validation)Table 1 (One Shot (CC) vs Original Trainset)Table 1

What To Try In 7 Days

Run one-shot GPT-4 to synthesize paragraph contexts from a handful of in-domain examples.

Generate QA pairs conditioned on those synthetic contexts.

Apply round-trip filtration: regenerate answers and keep QA pairs when answers match exactly to boost precision.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Performance depends on original training size; very small datasets may not benefit.

Evaluation variance is high when test sets are tiny (TechQA test had 9 examples).

When Not To Use

When your labeled training set is extremely small (≈1–2k or fewer) and test sets are tiny.

For very technical domains where one-shot prompts cannot capture domain breadth.

Failure Modes

Generated contexts or answers may hallucinate facts not present in real documents.

Round-trip filtering trades recall for precision and may remove useful diverse examples.

Core Entities

Models

GPT-4RoBERTa-BaseT5 (question generation baseline)

Metrics

Exact MatchF1

Datasets

CovidQA (2,019 QA pairs)PolicyQA (12,102 QA pairs)TechQA (1,808 examples)

Context Entities

Models

GPT-3 (cited)GPT-2 (cited)