Overview
Results show consistent small gains (2–5% F1) on three CG subsets with BERT+ADB; evidence is empirical but limited to specific backbone, CG splits, and prompt choices, so treat as promising proof-of-concept.
Citations6
Evidence Strength0.60
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
Adding LLM-generated paraphrases can cheaply raise intent-detection performance under realistic language variation, reducing missed or misrouted user requests and improving conversational UX.
Who Should Care
Summary TLDR
The authors build compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning training/test pairs with high ROUGE-L overlap. They use ChatGPT to synthesize paraphrases and add them to BERT+ADB training under three strategies (4 paraphrases, 10 paraphrases, or 10 for wrongly predicted instances). Across the CG subsets, adding ChatGPT paraphrases (GPTAUG-F4/F10) usually raises F1 and accuracy by a few percent versus baselines; the targeted wrong-prediction strategy (GPTAUG-WP10) often hurts. Results are limited to BERT+ADB/DA-ADB on three datasets and depend on the chosen prompts and pruning thresholds.
Problem Statement
Open intent detection must find previously unseen user intents. Existing benchmarks leak compositional overlap between train and test, hiding how models handle new combinations of words. The paper asks: can paraphrases from a large LLM (ChatGPT) act as cheap synthetic data to improve compositional generalization?
Main Contribution
Construct compositionally diverse subsets (Banking_CG, OOS_CG, StackOverflow_CG) by pruning high ROUGE-L overlap pairs
Use ChatGPT to generate paraphrases as data augmentation for open intent detection
Key Findings
Adding ChatGPT paraphrases to BERT+ADB improves overall F1 on compositionally-challenging Banking_CG.
Uniform augmentation (generate paraphrases for all train instances) outperformed targeting only wrongly predicted instances.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| F1-All (Banking_CG) | ADB 54.87 -> ADB+GPTAUG-F10 58.90 | ADB | +4.03 | Banking_CG | Table 1 shows ADB baseline and ADB+GPTAUG-F10 | Table 1 |
| F1-All (Banking_CG) | ADB+GPTAUG-F4 58.06 -> ADB+GPTAUG-WP10 51.06 | ADB+GPTAUG-F4 | -7.00 | Banking_CG | Table 1 compares uniform vs wrong-prediction augmentation | Table 1 |
What To Try In 7 Days
Run ROUGE-L pruning on your train/test splits to measure compositional overlap
Use ChatGPT to generate 4–10 paraphrases per training utterance and add them to fine-tuning
Compare uniform augmentation vs mistake-focused augmentation; prefer uniform first
Reproducibility
Risks & Boundaries
Limitations
Experiments limited to BERT-base with ADB/DA-ADB; other models may behave differently
ChatGPT prompts and paraphrase quality are not fully detailed; gains depend on prompt design
When Not To Use
When data cannot be sent to an external LLM due to privacy or compliance
If your train/test split is already compositionally similar and augmentation risks label drift
Failure Modes
ChatGPT paraphrases that change intent meaning, causing label noise
Targeted paraphrasing of only wrong examples (WP10) can reduce diversity and worsen performance

