Overview
Production Readiness
0.5
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
LLMs can cheaply expand labeled training sets and reduce manual annotation for moderate low-resource QA domains, yielding measurable accuracy gains; but gains are dataset-dependent and fragile when labeled data is very scarce.
Summary TLDR
The authors use GPT-4 to synthesize paragraph contexts and extractive question-answer pairs via one- or two-shot prompting, then filter outputs with a round-trip consistency test. Fine-tuning RoBERTa-Base on the augmented data improves Exact Match (EM) and F1 on two moderate low-resource datasets (CovidQA: EM +6.1 pts, F1 +7.8 pts; PolicyQA: EM +1.6 pts, F1 +1.5 pts) but does not reliably help the smallest dataset (TechQA). The pipeline and filtered augmented datasets are released for research.
Problem Statement
Annotating extractive reading-comprehension data is costly and many domains have too few labeled examples. The paper asks: can a large LLM (GPT-4) generate synthetic contexts and QA pairs to substitute or supplement human labels for low-resource extractive QA?
Main Contribution
A two-stage GPT-4 pipeline that first synthesizes paragraph contexts (one/two-shot) and then generates QA pairs conditioned on those contexts.
A round-trip (cycle-consistency) filter: regenerate an answer given the question and keep QA pairs only if answers match.
Empirical study on three low-resource datasets (CovidQA, PolicyQA, TechQA) showing consistent gains on CovidQA and PolicyQA but not on TechQA.
Release of augmented versions of the three low-resource datasets to encourage follow-up work.
Key Findings
On CovidQA, one-shot generation plus round-trip filtration improved RoBERTa EM and F1 over the original training set.
On PolicyQA, unfiltered one-shot synthetic data produced modest improvements.
On TechQA (smallest set), synthetic augmentation did not reliably improve performance and evaluation was high-variance.
Results
Exact Match
F1 Score
Exact Match
F1 Score
Exact Match
Who Should Care
What To Try In 7 Days
Run one-shot GPT-4 to synthesize paragraph contexts from a handful of in-domain examples.
Generate QA pairs conditioned on those synthetic contexts.
Apply round-trip filtration: regenerate answers and keep QA pairs when answers match exactly to boost precision.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance depends on original training size; very small datasets may not benefit.
- Evaluation variance is high when test sets are tiny (TechQA test had 9 examples).
- No cost or latency figures for GPT-4 generation in the paper.
- Possible domain mismatch: synthetic contexts may miss real-world technical detail.
When Not To Use
- When your labeled training set is extremely small (≈1–2k or fewer) and test sets are tiny.
- For very technical domains where one-shot prompts cannot capture domain breadth.
- If you cannot afford the API cost or need deterministic human-level labels.
Failure Modes
- Generated contexts or answers may hallucinate facts not present in real documents.
- Round-trip filtering trades recall for precision and may remove useful diverse examples.
- Models fine-tuned on synthetic data may overfit synthetic patterns and fail on real distributions.
- High evaluation variance with tiny test sets can hide true performance.
Core Entities
Models
- GPT-4
- RoBERTa-Base
- T5 (question generation baseline)
Metrics
- Exact Match
- F1
Datasets
- CovidQA (2,019 QA pairs)
- PolicyQA (12,102 QA pairs)
- TechQA (1,808 examples)
Context Entities
Models
- GPT-3 (cited)
- GPT-2 (cited)

