ChatGPT can track multi-turn dialogue states zero-shot, but struggles with slot-filling and long conversations

April 9, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.35

Cost Impact Score

0.3

Citation Count

21

Authors

Wenbo Pan, Qiguang Chen, Xiao Xu, Wanxiang Che, Libo Qin

Links

Abstract / PDF

Why It Matters For Business

ChatGPT can be used zero-shot to prototype multi-turn dialogue state tracking with near research-level JGA, but is unreliable for precise slot extraction without careful prompt design and output checks.

Summary TLDR

This paper measures ChatGPT (Jan 30 version) on zero-shot dialogue understanding: spoken language understanding (SLU) and dialogue state tracking (DST). ChatGPT matches or beats GPT-3.5/Codex on multi-turn DST when fed a multi-turn interactive prompt, achieving JGA ~60% on MultiWOZ. But it underperforms at slot filling in SLU (slot F1 as low as ~15.7 on ATIS). The authors also document format errors, undefined slot values, and forgetting in long (>10) conversations.

Problem Statement

Can ChatGPT perform zero-shot dialogue understanding (intent detection, slot filling, and per-turn dialogue state tracking) without task-specific training? The paper tests whether prompt design and multi-turn interaction can unlock reliable zero-shot performance and identifies practical failure modes.

Main Contribution

Empirical zero-shot evaluation of ChatGPT on two SLU datasets (ATIS, SNIPS) and two DST datasets (MultiWOZ2.1, MultiWOZ2.4).

A multi-turn interactive prompt framework that feeds the dialogue turn-by-turn to ChatGPT for DST.

Analysis of prompt components (slot names, descriptions, examples) and documentation of practical failure modes (format violations, undefined values, forgetting).

Key Findings

ChatGPT achieves competitive multi-turn DST but lags fine-tuned SOTA.

NumbersMultiWOZ2.1 JGA 60.28% vs fine-tuned SOTA 61.02% (Table 3)

Multi-turn interactive prompting improves ChatGPT's DST accuracy over single-turn prompts.

NumbersJGA +1.97 points (58.05% -> 60.02%) on MultiWOZ2.1 (Table 4)

ChatGPT struggles at slot filling in SLU without examples/descriptions.

NumbersATIS slot F1 15.71%; SNIPS slot F1 58.24% (Table 3)

Providing slot names, descriptions, and examples raises slot performance.

NumbersSNIPS slot F1: name only 25.78% -> w/ examples 58.24% (Table 5)

ChatGPT shows output-format and memory failures in dialogue settings.

NumbersReported behaviors: undefined slot tokens, format violations, verbose natural-language outputs, forgetting after >10+ tw

Results

Accuracy

Value97.71%

BaselineGPT-3.5 98.00%

SNIPS Slot F1 (ChatGPT)

Value58.24%

BaselineFinetuned SoTA 97.10%

Accuracy

Value75.22%

BaselineFinetuned SoTA 98.00%

ATIS Slot F1 (ChatGPT)

Value15.71%

BaselineFinetuned SoTA 96.10%

MultiWOZ2.1 JGA (ChatGPT)

Value60.28%

BaselineFinetuned SoTA 61.02%

Accuracy

Value97.83%

BaselineFinetuned SoTA 98.05%

MultiWOZ2.4 JGA (ChatGPT)

Value64.23%

BaselineCodex 37.50% (other baselines vary)

Accuracy

Value98.12%

BaselineCodex 95.68%

Multi-turn vs Single-turn JGA (MultiWOZ2.1)

Value60.02% vs 58.05%

Prompt design effect on SNIPS slot F1

Valuename only 25.78% -> w/ examples 58.24%

Who Should Care

What To Try In 7 Days

Prototype a DST flow using multi-turn incremental prompts to maintain state.

Add slot descriptions and 1–2 examples in prompts before extracting slot values.

Wrap ChatGPT outputs with a small validator that enforces formats and fills or flags missing slots.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation uses one ChatGPT snapshot (Jan 30); results may change as model updates.
  • Only compares to GPT-3.5 and Codex; broader LLM baselines are missing.
  • Does not cover zero-shot cross-domain or cross-lingual settings.
  • No released code or prompt templates in paper for exact reproduction.

When Not To Use

  • For high-precision slot-filling applications (e.g., billing, bookings) without extra validation.
  • In very long conversations (>10 turns) where ChatGPT may forget earlier context.
  • When you need reproducible exact behavior across model updates.

Failure Modes

  • Undefined slot values (outputs like <unknown> or special tokens).
  • Slot format violations (wrong time formats, extra words).
  • Verbose natural-language replies instead of strict slot-value pairs.
  • Forgetting early turns in long dialogues due to prompt length limits.

Core Entities

Models

  • ChatGPT (Jan 30 version)
  • GPT-3.5 (text-davinci-003)
  • Codex

Metrics

  • Accuracy
  • Slot F1

Datasets

  • ATIS
  • SNIPS
  • MultiWOZ2.1
  • MultiWOZ2.4

Benchmarks

  • zero-shot SLU
  • zero-shot DST