ChatGPT can track multi-turn dialogue states zero-shot, but struggles with slot-filling and long conversations

April 9, 20237 min

Overview

Decision SnapshotNeeds Validation

The paper gives direct automatic-metric comparisons and ablations but uses a single ChatGPT snapshot and lacks released code, so findings are useful but preliminary.

Citations21

Evidence Strength0.65

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/10

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 35%

Authors

Wenbo Pan, Qiguang Chen, Xiao Xu, Wanxiang Che, Libo Qin

Links

Abstract / PDF

Why It Matters For Business

ChatGPT can be used zero-shot to prototype multi-turn dialogue state tracking with near research-level JGA, but is unreliable for precise slot extraction without careful prompt design and output checks.

Who Should Care

Summary TLDR

This paper measures ChatGPT (Jan 30 version) on zero-shot dialogue understanding: spoken language understanding (SLU) and dialogue state tracking (DST). ChatGPT matches or beats GPT-3.5/Codex on multi-turn DST when fed a multi-turn interactive prompt, achieving JGA ~60% on MultiWOZ. But it underperforms at slot filling in SLU (slot F1 as low as ~15.7 on ATIS). The authors also document format errors, undefined slot values, and forgetting in long (>10) conversations.

Problem Statement

Can ChatGPT perform zero-shot dialogue understanding (intent detection, slot filling, and per-turn dialogue state tracking) without task-specific training? The paper tests whether prompt design and multi-turn interaction can unlock reliable zero-shot performance and identifies practical failure modes.

Main Contribution

Empirical zero-shot evaluation of ChatGPT on two SLU datasets (ATIS, SNIPS) and two DST datasets (MultiWOZ2.1, MultiWOZ2.4).

A multi-turn interactive prompt framework that feeds the dialogue turn-by-turn to ChatGPT for DST.

Key Findings

ChatGPT achieves competitive multi-turn DST but lags fine-tuned SOTA.

NumbersMultiWOZ2.1 JGA 60.28% vs fine-tuned SOTA 61.02% (Table 3)

Practical UseUse ChatGPT zero-shot for multi-turn state tracking prototypes; expect near research baseline JGA but not the top fine-tuned numbers.

Evidence RefTable 3

Multi-turn interactive prompting improves ChatGPT's DST accuracy over single-turn prompts.

NumbersJGA +1.97 points (58.05% -> 60.02%) on MultiWOZ2.1 (Table 4)

Practical UseWhen tracking dialogue state with ChatGPT, send context incrementally (turn-by-turn) rather than isolated single-turn prompts.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy97.71%GPT-3.5 98.00%SNIPS testTable 3 shows ChatGPT intent accuracy 97.71%Table 3
SNIPS Slot F1 (ChatGPT)58.24%Finetuned SoTA 97.10%SNIPS testTable 3 shows ChatGPT slot F1 58.24%Table 3

What To Try In 7 Days

Prototype a DST flow using multi-turn incremental prompts to maintain state.

Add slot descriptions and 1–2 examples in prompts before extracting slot values.

Wrap ChatGPT outputs with a small validator that enforces formats and fills or flags missing slots.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses one ChatGPT snapshot (Jan 30); results may change as model updates.

Only compares to GPT-3.5 and Codex; broader LLM baselines are missing.

When Not To Use

For high-precision slot-filling applications (e.g., billing, bookings) without extra validation.

In very long conversations (>10 turns) where ChatGPT may forget earlier context.

Failure Modes

Undefined slot values (outputs like <unknown> or special tokens).

Slot format violations (wrong time formats, extra words).

Core Entities

Models

ChatGPT (Jan 30 version)GPT-3.5 (text-davinci-003)Codex

Metrics

AccuracySlot F1

Datasets

ATISSNIPSMultiWOZ2.1MultiWOZ2.4

Benchmarks

zero-shot SLUzero-shot DST