Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.5
Citation Count
6
Why It Matters For Business
Adding LLM-generated paraphrases can cheaply raise intent-detection performance under realistic language variation, reducing missed or misrouted user requests and improving conversational UX.
Summary TLDR
The authors build compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning training/test pairs with high ROUGE-L overlap. They use ChatGPT to synthesize paraphrases and add them to BERT+ADB training under three strategies (4 paraphrases, 10 paraphrases, or 10 for wrongly predicted instances). Across the CG subsets, adding ChatGPT paraphrases (GPTAUG-F4/F10) usually raises F1 and accuracy by a few percent versus baselines; the targeted wrong-prediction strategy (GPTAUG-WP10) often hurts. Results are limited to BERT+ADB/DA-ADB on three datasets and depend on the chosen prompts and pruning thresholds.
Problem Statement
Open intent detection must find previously unseen user intents. Existing benchmarks leak compositional overlap between train and test, hiding how models handle new combinations of words. The paper asks: can paraphrases from a large LLM (ChatGPT) act as cheap synthetic data to improve compositional generalization?
Main Contribution
Construct compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning high ROUGE-L overlap pairs
Use ChatGPT to generate paraphrases as data augmentation for open intent detection
Compare three augmentation strategies (GPTAUG-F4, GPTAUG-F10, GPTAUG-WP10) with BERT+ADB and DA-ADB across the CG subsets
Key Findings
Adding ChatGPT paraphrases to BERT+ADB improves overall F1 on compositionally-challenging Banking_CG.
Uniform augmentation (generate paraphrases for all train instances) outperformed targeting only wrongly predicted instances.
Existing adaptive decision-boundary methods struggled on compositionally diverse splits; results vary by dataset.
Results
F1-All (Banking_CG)
F1-All (Banking_CG)
F1-All (OOS_CG)
Who Should Care
What To Try In 7 Days
Run ROUGE-L pruning on your train/test splits to measure compositional overlap
Use ChatGPT to generate 4–10 paraphrases per training utterance and add them to fine-tuning
Compare uniform augmentation vs mistake-focused augmentation; prefer uniform first
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Experiments limited to BERT-base with ADB/DA-ADB; other models may behave differently
- ChatGPT prompts and paraphrase quality are not fully detailed; gains depend on prompt design
- No code or data release linked in the paper, limiting immediate reproducibility
When Not To Use
- When data cannot be sent to an external LLM due to privacy or compliance
- If your train/test split is already compositionally similar and augmentation risks label drift
- When inferential cost of generating many paraphrases is prohibitive
Failure Modes
- ChatGPT paraphrases that change intent meaning, causing label noise
- Targeted paraphrasing of only wrong examples (WP10) can reduce diversity and worsen performance
- Overfitting to LLM style paraphrases and losing robustness to real user language
Core Entities
Models
- BERT-base
- ChatGPT
- ADB (Deep Open Intent Classification with Adaptive Decision Boundary)
- DA-ADB
Metrics
- F1-IND
- F1-OOD
- F1-All
- Acc-All
Datasets
- Banking
- OOS
- StackOverflow
- Banking_CG
- OOS_CG
- StackOverflow_CG
Benchmarks
- Banking_CG
- OOS_CG
- StackOverflow_CG

