ChatGPT can track multi-turn dialogue states zero-shot, but struggles with slot-filling and long conversations

Overview

Decision SnapshotNeeds Validation

The paper gives direct automatic-metric comparisons and ablations but uses a single ChatGPT snapshot and lacks released code, so findings are useful but preliminary.

Citations21

Evidence Strength0.65

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/10

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 35%

Authors

Wenbo Pan, Qiguang Chen, Xiao Xu, Wanxiang Che, Libo Qin

Links

Abstract / PDF

Why It Matters For Business

ChatGPT can be used zero-shot to prototype multi-turn dialogue state tracking with near research-level JGA, but is unreliable for precise slot extraction without careful prompt design and output checks.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This paper measures ChatGPT (Jan 30 version) on zero-shot dialogue understanding: spoken language understanding (SLU) and dialogue state tracking (DST). ChatGPT matches or beats GPT-3.5/Codex on multi-turn DST when fed a multi-turn interactive prompt, achieving JGA ~60% on MultiWOZ. But it underperforms at slot filling in SLU (slot F1 as low as ~15.7 on ATIS). The authors also document format errors, undefined slot values, and forgetting in long (>10) conversations.

Problem Statement

Can ChatGPT perform zero-shot dialogue understanding (intent detection, slot filling, and per-turn dialogue state tracking) without task-specific training? The paper tests whether prompt design and multi-turn interaction can unlock reliable zero-shot performance and identifies practical failure modes.

Main Contribution

Empirical zero-shot evaluation of ChatGPT on two SLU datasets (ATIS, SNIPS) and two DST datasets (MultiWOZ2.1, MultiWOZ2.4).

A multi-turn interactive prompt framework that feeds the dialogue turn-by-turn to ChatGPT for DST.

Key Findings

ChatGPT achieves competitive multi-turn DST but lags fine-tuned SOTA.

NumbersMultiWOZ2.1 JGA 60.28% vs fine-tuned SOTA 61.02% (Table 3)

Practical UseUse ChatGPT zero-shot for multi-turn state tracking prototypes; expect near research baseline JGA but not the top fine-tuned numbers.

Evidence RefTable 3

Multi-turn interactive prompting improves ChatGPT's DST accuracy over single-turn prompts.

NumbersJGA +1.97 points (58.05% -> 60.02%) on MultiWOZ2.1 (Table 4)

Practical UseWhen tracking dialogue state with ChatGPT, send context incrementally (turn-by-turn) rather than isolated single-turn prompts.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	97.71%	GPT-3.5 98.00%	—	SNIPS test	Table 3 shows ChatGPT intent accuracy 97.71%	Table 3
SNIPS Slot F1 (ChatGPT)	58.24%	Finetuned SoTA 97.10%	—	SNIPS test	Table 3 shows ChatGPT slot F1 58.24%	Table 3

What To Try In 7 Days

Prototype a DST flow using multi-turn incremental prompts to maintain state.

Add slot descriptions and 1–2 examples in prompts before extracting slot values.

Wrap ChatGPT outputs with a small validator that enforces formats and fills or flags missing slots.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses one ChatGPT snapshot (Jan 30); results may change as model updates.

Only compares to GPT-3.5 and Codex; broader LLM baselines are missing.

When Not To Use

For high-precision slot-filling applications (e.g., billing, bookings) without extra validation.

In very long conversations (>10 turns) where ChatGPT may forget earlier context.

Failure Modes

Undefined slot values (outputs like <unknown> or special tokens).

Slot format violations (wrong time formats, extra words).

Core Entities

Models

ChatGPT (Jan 30 version)GPT-3.5 (text-davinci-003)Codex

Metrics

AccuracySlot F1

Datasets

ATISSNIPSMultiWOZ2.1MultiWOZ2.4

Benchmarks

zero-shot SLUzero-shot DST

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatGPT achieves competitive multi-turn DST but lags fine-tuned SOTA.

Multi-turn interactive prompting improves ChatGPT's DST accuracy over single-turn prompts.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding