Use ChatGPT to generate paraphrases and improve open-intent detection on compositionally different test sets

Overview

Decision SnapshotNeeds Validation

Results show consistent small gains (2–5% F1) on three CG subsets with BERT+ADB; evidence is empirical but limited to specific backbone, CG splits, and prompt choices, so treat as promising proof-of-concept.

Citations6

Evidence Strength0.60

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 50%

Authors

Yihao Fang, Xianzhi Li, Stephen W. Thomas, Xiaodan Zhu

Links

Abstract / PDF

Why It Matters For Business

Adding LLM-generated paraphrases can cheaply raise intent-detection performance under realistic language variation, reducing missed or misrouted user requests and improving conversational UX.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

The authors build compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning training/test pairs with high ROUGE-L overlap. They use ChatGPT to synthesize paraphrases and add them to BERT+ADB training under three strategies (4 paraphrases, 10 paraphrases, or 10 for wrongly predicted instances). Across the CG subsets, adding ChatGPT paraphrases (GPTAUG-F4/F10) usually raises F1 and accuracy by a few percent versus baselines; the targeted wrong-prediction strategy (GPTAUG-WP10) often hurts. Results are limited to BERT+ADB/DA-ADB on three datasets and depend on the chosen prompts and pruning thresholds.

Problem Statement

Open intent detection must find previously unseen user intents. Existing benchmarks leak compositional overlap between train and test, hiding how models handle new combinations of words. The paper asks: can paraphrases from a large LLM (ChatGPT) act as cheap synthetic data to improve compositional generalization?

Main Contribution

Construct compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning high ROUGE-L overlap pairs

Use ChatGPT to generate paraphrases as data augmentation for open intent detection

Key Findings

Adding ChatGPT paraphrases to BERT+ADB improves overall F1 on compositionally-challenging Banking_CG.

NumbersF1-All: 54.87 -> 58.90 (+4.03)

Practical UseIf you train BERT+ADB for open intent detection, adding ~10 ChatGPT paraphrases per instance can raise F1 by ~4 points on compositionally dissimilar test sets; test this augmentation to close train/test composition gaps.

Evidence RefTable 1 (ADB vs ADB+GPTAUG-F10 on Banking_CG)

Uniform augmentation (generate paraphrases for all train instances) outperformed targeting only wrongly predicted instances.

NumbersBanking_CG F1-All: GPTAUG-F4 58.06 vs GPTAUG-WP10 51.06 (−7.00)

Practical UsePrefer broad paraphrase augmentation (4–10 per example) over focused paraphrasing of mistakes; targeted paraphrasing can reduce diversity and harm performance.

Evidence RefTable 1 (ADB+GPTAUG-F4 vs ADB+GPTAUG-WP10 on Banking_CG)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
F1-All (Banking_CG)	ADB 54.87 -> ADB+GPTAUG-F10 58.90	ADB	+4.03	Banking_CG	Table 1 shows ADB baseline and ADB+GPTAUG-F10	Table 1
F1-All (Banking_CG)	ADB+GPTAUG-F4 58.06 -> ADB+GPTAUG-WP10 51.06	ADB+GPTAUG-F4	-7.00	Banking_CG	Table 1 compares uniform vs wrong-prediction augmentation	Table 1

What To Try In 7 Days

Run ROUGE-L pruning on your train/test splits to measure compositional overlap

Use ChatGPT to generate 4–10 paraphrases per training utterance and add them to fine-tuning

Compare uniform augmentation vs mistake-focused augmentation; prefer uniform first

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Experiments limited to BERT-base with ADB/DA-ADB; other models may behave differently

ChatGPT prompts and paraphrase quality are not fully detailed; gains depend on prompt design

When Not To Use

When data cannot be sent to an external LLM due to privacy or compliance

If your train/test split is already compositionally similar and augmentation risks label drift

Failure Modes

ChatGPT paraphrases that change intent meaning, causing label noise

Targeted paraphrasing of only wrong examples (WP10) can reduce diversity and worsen performance

Core Entities

Models

BERT-baseChatGPTADB (Deep Open Intent Classification with Adaptive Decision Boundary)DA-ADB

Metrics

F1-INDF1-OODF1-AllAcc-All

Datasets

BankingOOSStackOverflowBanking_CGOOS_CGStackOverflow_CG

Benchmarks

Banking_CGOOS_CGStackOverflow_CG

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding ChatGPT paraphrases to BERT+ADB improves overall F1 on compositionally-challenging Banking_CG.

Uniform augmentation (generate paraphrases for all train instances) outperformed targeting only wrongly predicted instances.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding