Overview
Production Readiness
0.4
Novelty Score
0.35
Cost Impact Score
0.3
Citation Count
21
Why It Matters For Business
ChatGPT can be used zero-shot to prototype multi-turn dialogue state tracking with near research-level JGA, but is unreliable for precise slot extraction without careful prompt design and output checks.
Summary TLDR
This paper measures ChatGPT (Jan 30 version) on zero-shot dialogue understanding: spoken language understanding (SLU) and dialogue state tracking (DST). ChatGPT matches or beats GPT-3.5/Codex on multi-turn DST when fed a multi-turn interactive prompt, achieving JGA ~60% on MultiWOZ. But it underperforms at slot filling in SLU (slot F1 as low as ~15.7 on ATIS). The authors also document format errors, undefined slot values, and forgetting in long (>10) conversations.
Problem Statement
Can ChatGPT perform zero-shot dialogue understanding (intent detection, slot filling, and per-turn dialogue state tracking) without task-specific training? The paper tests whether prompt design and multi-turn interaction can unlock reliable zero-shot performance and identifies practical failure modes.
Main Contribution
Empirical zero-shot evaluation of ChatGPT on two SLU datasets (ATIS, SNIPS) and two DST datasets (MultiWOZ2.1, MultiWOZ2.4).
A multi-turn interactive prompt framework that feeds the dialogue turn-by-turn to ChatGPT for DST.
Analysis of prompt components (slot names, descriptions, examples) and documentation of practical failure modes (format violations, undefined values, forgetting).
Key Findings
ChatGPT achieves competitive multi-turn DST but lags fine-tuned SOTA.
Multi-turn interactive prompting improves ChatGPT's DST accuracy over single-turn prompts.
ChatGPT struggles at slot filling in SLU without examples/descriptions.
Providing slot names, descriptions, and examples raises slot performance.
ChatGPT shows output-format and memory failures in dialogue settings.
Results
Accuracy
SNIPS Slot F1 (ChatGPT)
Accuracy
ATIS Slot F1 (ChatGPT)
MultiWOZ2.1 JGA (ChatGPT)
Accuracy
MultiWOZ2.4 JGA (ChatGPT)
Accuracy
Multi-turn vs Single-turn JGA (MultiWOZ2.1)
Prompt design effect on SNIPS slot F1
Who Should Care
What To Try In 7 Days
Prototype a DST flow using multi-turn incremental prompts to maintain state.
Add slot descriptions and 1–2 examples in prompts before extracting slot values.
Wrap ChatGPT outputs with a small validator that enforces formats and fills or flags missing slots.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation uses one ChatGPT snapshot (Jan 30); results may change as model updates.
- Only compares to GPT-3.5 and Codex; broader LLM baselines are missing.
- Does not cover zero-shot cross-domain or cross-lingual settings.
- No released code or prompt templates in paper for exact reproduction.
When Not To Use
- For high-precision slot-filling applications (e.g., billing, bookings) without extra validation.
- In very long conversations (>10 turns) where ChatGPT may forget earlier context.
- When you need reproducible exact behavior across model updates.
Failure Modes
- Undefined slot values (outputs like <unknown> or special tokens).
- Slot format violations (wrong time formats, extra words).
- Verbose natural-language replies instead of strict slot-value pairs.
- Forgetting early turns in long dialogues due to prompt length limits.
Core Entities
Models
- ChatGPT (Jan 30 version)
- GPT-3.5 (text-davinci-003)
- Codex
Metrics
- Accuracy
- Slot F1
Datasets
- ATIS
- SNIPS
- MultiWOZ2.1
- MultiWOZ2.4
Benchmarks
- zero-shot SLU
- zero-shot DST

