Use ChatGPT to generate paraphrases and improve open-intent detection on compositionally different test sets

August 25, 20236 min

Overview

Decision SnapshotNeeds Validation

Results show consistent small gains (2–5% F1) on three CG subsets with BERT+ADB; evidence is empirical but limited to specific backbone, CG splits, and prompt choices, so treat as promising proof-of-concept.

Citations6

Evidence Strength0.60

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 50%

Authors

Yihao Fang, Xianzhi Li, Stephen W. Thomas, Xiaodan Zhu

Links

Abstract / PDF

Why It Matters For Business

Adding LLM-generated paraphrases can cheaply raise intent-detection performance under realistic language variation, reducing missed or misrouted user requests and improving conversational UX.

Who Should Care

Summary TLDR

The authors build compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning training/test pairs with high ROUGE-L overlap. They use ChatGPT to synthesize paraphrases and add them to BERT+ADB training under three strategies (4 paraphrases, 10 paraphrases, or 10 for wrongly predicted instances). Across the CG subsets, adding ChatGPT paraphrases (GPTAUG-F4/F10) usually raises F1 and accuracy by a few percent versus baselines; the targeted wrong-prediction strategy (GPTAUG-WP10) often hurts. Results are limited to BERT+ADB/DA-ADB on three datasets and depend on the chosen prompts and pruning thresholds.

Problem Statement

Open intent detection must find previously unseen user intents. Existing benchmarks leak compositional overlap between train and test, hiding how models handle new combinations of words. The paper asks: can paraphrases from a large LLM (ChatGPT) act as cheap synthetic data to improve compositional generalization?

Main Contribution

Construct compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning high ROUGE-L overlap pairs

Use ChatGPT to generate paraphrases as data augmentation for open intent detection

Key Findings

Adding ChatGPT paraphrases to BERT+ADB improves overall F1 on compositionally-challenging Banking_CG.

NumbersF1-All: 54.87 -> 58.90 (+4.03)

Practical UseIf you train BERT+ADB for open intent detection, adding ~10 ChatGPT paraphrases per instance can raise F1 by ~4 points on compositionally dissimilar test sets; test this augmentation to close train/test composition gaps.

Evidence RefTable 1 (ADB vs ADB+GPTAUG-F10 on Banking_CG)

Uniform augmentation (generate paraphrases for all train instances) outperformed targeting only wrongly predicted instances.

NumbersBanking_CG F1-All: GPTAUG-F4 58.06 vs GPTAUG-WP10 51.06 (−7.00)

Practical UsePrefer broad paraphrase augmentation (4–10 per example) over focused paraphrasing of mistakes; targeted paraphrasing can reduce diversity and harm performance.

Evidence RefTable 1 (ADB+GPTAUG-F4 vs ADB+GPTAUG-WP10 on Banking_CG)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
F1-All (Banking_CG)ADB 54.87 -> ADB+GPTAUG-F10 58.90ADB+4.03Banking_CGTable 1 shows ADB baseline and ADB+GPTAUG-F10Table 1
F1-All (Banking_CG)ADB+GPTAUG-F4 58.06 -> ADB+GPTAUG-WP10 51.06ADB+GPTAUG-F4-7.00Banking_CGTable 1 compares uniform vs wrong-prediction augmentationTable 1

What To Try In 7 Days

Run ROUGE-L pruning on your train/test splits to measure compositional overlap

Use ChatGPT to generate 4–10 paraphrases per training utterance and add them to fine-tuning

Compare uniform augmentation vs mistake-focused augmentation; prefer uniform first

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Experiments limited to BERT-base with ADB/DA-ADB; other models may behave differently

ChatGPT prompts and paraphrase quality are not fully detailed; gains depend on prompt design

When Not To Use

When data cannot be sent to an external LLM due to privacy or compliance

If your train/test split is already compositionally similar and augmentation risks label drift

Failure Modes

ChatGPT paraphrases that change intent meaning, causing label noise

Targeted paraphrasing of only wrong examples (WP10) can reduce diversity and worsen performance

Core Entities

Models

BERT-baseChatGPTADB (Deep Open Intent Classification with Adaptive Decision Boundary)DA-ADB

Metrics

F1-INDF1-OODF1-AllAcc-All

Datasets

BankingOOSStackOverflowBanking_CGOOS_CGStackOverflow_CG

Benchmarks

Banking_CGOOS_CGStackOverflow_CG