GPT-4 can generate synthetic paragraphs and QA to improve some low-resource extractive QA datasets, but results depend on dataset size and c

Overview

Decision SnapshotNeeds Validation

The method is practical and reproducible (RoBERTa fine-tuning, GPT-4 prompts). Evidence is limited to three datasets; results are convincing for moderate data sizes but weak for very small datasets.

Citations1

Evidence Strength0.60

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Vinay Samuel, Houda Aynaou, Arijit Ghosh Chowdhury, Karthik Venkat Ramanan, Aman Chadha

Links

Abstract / PDF

Why It Matters For Business

LLMs can cheaply expand labeled training sets and reduce manual annotation for moderate low-resource QA domains, yielding measurable accuracy gains; but gains are dataset-dependent and fragile when labeled data is very scarce.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors use GPT-4 to synthesize paragraph contexts and extractive question-answer pairs via one- or two-shot prompting, then filter outputs with a round-trip consistency test. Fine-tuning RoBERTa-Base on the augmented data improves Exact Match (EM) and F1 on two moderate low-resource datasets (CovidQA: EM +6.1 pts, F1 +7.8 pts; PolicyQA: EM +1.6 pts, F1 +1.5 pts) but does not reliably help the smallest dataset (TechQA). The pipeline and filtered augmented datasets are released for research.

Problem Statement

Annotating extractive reading-comprehension data is costly and many domains have too few labeled examples. The paper asks: can a large LLM (GPT-4) generate synthetic contexts and QA pairs to substitute or supplement human labels for low-resource extractive QA?

Main Contribution

A two-stage GPT-4 pipeline that first synthesizes paragraph contexts (one/two-shot) and then generates QA pairs conditioned on those contexts.

A round-trip (cycle-consistency) filter: regenerate an answer given the question and keep QA pairs only if answers match.

Key Findings

On CovidQA, one-shot generation plus round-trip filtration improved RoBERTa EM and F1 over the original training set.

NumbersEM 25.81 -> 31.90 (+6.09); F1 50.91 -> 58.66 (+7.75)

Practical UseFor similar moderate low-resource domains, use one-shot GPT-4 generation plus round-trip filtering to get measurable QA gains.

Evidence RefTable 1 (CovidQA; One Shot (CC))

On PolicyQA, unfiltered one-shot synthetic data produced modest improvements.

NumbersEM 30.56 -> 32.18 (+1.62); F1 58.15 -> 59.61 (+1.46)

Practical UseWhen you already have thousands of examples, adding unfiltered one-shot synthetic data can give small but consistent boosts without extra filtering.

Evidence RefTable 1 (PolicyQA; One Shot)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Exact Match	31.90	25.81 (original trainset)	+6.09	CovidQA (validation)	Table 1 (One Shot (CC) vs Original Trainset)	Table 1
F1 Score	58.66	50.91 (original trainset)	+7.75	CovidQA (validation)	Table 1 (One Shot (CC) vs Original Trainset)	Table 1

What To Try In 7 Days

Run one-shot GPT-4 to synthesize paragraph contexts from a handful of in-domain examples.

Generate QA pairs conditioned on those synthetic contexts.

Apply round-trip filtration: regenerate answers and keep QA pairs when answers match exactly to boost precision.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Performance depends on original training size; very small datasets may not benefit.

Evaluation variance is high when test sets are tiny (TechQA test had 9 examples).

When Not To Use

When your labeled training set is extremely small (≈1–2k or fewer) and test sets are tiny.

For very technical domains where one-shot prompts cannot capture domain breadth.

Failure Modes

Generated contexts or answers may hallucinate facts not present in real documents.

Round-trip filtering trades recall for precision and may remove useful diverse examples.

GPT-4 can generate synthetic paragraphs and QA to improve some low-resource extractive QA datasets, but results depend on dataset size and c

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

On CovidQA, one-shot generation plus round-trip filtration improved RoBERTa EM and F1 over the original training set.

On PolicyQA, unfiltered one-shot synthetic data produced modest improvements.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

On CovidQA, one-shot generation plus round-trip filtration improved RoBERTa EM and F1 over the original training set.

On PolicyQA, unfiltered one-shot synthetic data produced modest improvements.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding