Use ChatGPT to generate paraphrases and improve open-intent detection on compositionally different test sets

August 25, 20236 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.5

Citation Count

6

Authors

Yihao Fang, Xianzhi Li, Stephen W. Thomas, Xiaodan Zhu

Links

Abstract / PDF

Why It Matters For Business

Adding LLM-generated paraphrases can cheaply raise intent-detection performance under realistic language variation, reducing missed or misrouted user requests and improving conversational UX.

Summary TLDR

The authors build compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning training/test pairs with high ROUGE-L overlap. They use ChatGPT to synthesize paraphrases and add them to BERT+ADB training under three strategies (4 paraphrases, 10 paraphrases, or 10 for wrongly predicted instances). Across the CG subsets, adding ChatGPT paraphrases (GPTAUG-F4/F10) usually raises F1 and accuracy by a few percent versus baselines; the targeted wrong-prediction strategy (GPTAUG-WP10) often hurts. Results are limited to BERT+ADB/DA-ADB on three datasets and depend on the chosen prompts and pruning thresholds.

Problem Statement

Open intent detection must find previously unseen user intents. Existing benchmarks leak compositional overlap between train and test, hiding how models handle new combinations of words. The paper asks: can paraphrases from a large LLM (ChatGPT) act as cheap synthetic data to improve compositional generalization?

Main Contribution

Construct compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning high ROUGE-L overlap pairs

Use ChatGPT to generate paraphrases as data augmentation for open intent detection

Compare three augmentation strategies (GPTAUG-F4, GPTAUG-F10, GPTAUG-WP10) with BERT+ADB and DA-ADB across the CG subsets

Key Findings

Adding ChatGPT paraphrases to BERT+ADB improves overall F1 on compositionally-challenging Banking_CG.

NumbersF1-All: 54.87 -> 58.90 (+4.03)

Uniform augmentation (generate paraphrases for all train instances) outperformed targeting only wrongly predicted instances.

NumbersBanking_CG F1-All: GPTAUG-F4 58.06 vs GPTAUG-WP10 51.06 (−7.00)

Existing adaptive decision-boundary methods struggled on compositionally diverse splits; results vary by dataset.

NumbersOOS_CG F1-All: DA-ADB 39.64 vs ADB 50.19 (≈−10.6)

Results

F1-All (Banking_CG)

ValueADB 54.87 -> ADB+GPTAUG-F10 58.90

BaselineADB

F1-All (Banking_CG)

ValueADB+GPTAUG-F4 58.06 -> ADB+GPTAUG-WP10 51.06

BaselineADB+GPTAUG-F4

F1-All (OOS_CG)

ValueDA-ADB 39.64 vs ADB 50.19

BaselineADB

Who Should Care

What To Try In 7 Days

Run ROUGE-L pruning on your train/test splits to measure compositional overlap

Use ChatGPT to generate 4–10 paraphrases per training utterance and add them to fine-tuning

Compare uniform augmentation vs mistake-focused augmentation; prefer uniform first

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Experiments limited to BERT-base with ADB/DA-ADB; other models may behave differently
  • ChatGPT prompts and paraphrase quality are not fully detailed; gains depend on prompt design
  • No code or data release linked in the paper, limiting immediate reproducibility

When Not To Use

  • When data cannot be sent to an external LLM due to privacy or compliance
  • If your train/test split is already compositionally similar and augmentation risks label drift
  • When inferential cost of generating many paraphrases is prohibitive

Failure Modes

  • ChatGPT paraphrases that change intent meaning, causing label noise
  • Targeted paraphrasing of only wrong examples (WP10) can reduce diversity and worsen performance
  • Overfitting to LLM style paraphrases and losing robustness to real user language

Core Entities

Models

  • BERT-base
  • ChatGPT
  • ADB (Deep Open Intent Classification with Adaptive Decision Boundary)
  • DA-ADB

Metrics

  • F1-IND
  • F1-OOD
  • F1-All
  • Acc-All

Datasets

  • Banking
  • OOS
  • StackOverflow
  • Banking_CG
  • OOS_CG
  • StackOverflow_CG

Benchmarks

  • Banking_CG
  • OOS_CG
  • StackOverflow_CG