TravelPlanner: a realistic travel-planning benchmark — GPT-4 reaches only 0.6% full success on test tasks

February 2, 20247 min

Overview

Decision SnapshotReady For Pilot

The paper provides a well-documented benchmark, multiple model comparisons, and error analyses showing consistent failures; evidence is experimental and systematic but limited to a static sandbox.

Citations13

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 100%

Novelty: 70%

Authors

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, Yu Su

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Current LLM agents are not yet reliable enough to fully automate complex multi-constraint planning; but they can draft plans quickly and cut human effort if paired with verification and robust data collection.

Who Should Care

Summary TLDR

TravelPlanner is a realistic benchmark that tests language agents on multi-day travel planning with many interdependent decisions and constraints. It provides a static sandbox of about 4 million records, six queryable tools (flights, hotels, restaurants, attractions, distances, city search), and 1,225 human-validated user queries with reference plans. Evaluations across five LLMs and four planning strategies show current agents fail at holistic constrained planning: GPT-4 achieves 0.6% final pass rate in the full (two-stage) setting, and other models score near zero. Agents struggle with tool arguments, repetitive dead loops, tracking multiple constraints, and global budget/night constraints

Problem Statement

Can modern LLM-powered agents perform realistic long-horizon, multi-constraint planning that requires iterative tool use, memory, and commonsense? TravelPlanner builds a controlled sandbox and a set of diverse travel queries to measure whether agents can gather information via tools and produce feasible plans that satisfy both explicit user constraints (budget, room rules, cuisine, transport) and commonsense constraints.

Main Contribution

A realistic planning benchmark focused on travel: 1,225 validated queries plus reference plans and a static sandbox of ~4M data entries accessible via six tools.

An automated evaluation suite that scores delivery rate, commonsense constraints (micro/macro), hard constraints (micro/macro), and final pass rate.

Key Findings

State-of-the-art LLMs largely fail to produce fully feasible travel plans.

NumbersGPT-4 final pass rate = 0.6% on test set (two-stage)

Practical UseDo not expect current off-the-shelf agents to reliably solve multi-constraint planning; use human verification or simpler workflows in production.

Evidence RefTable 3 (test, two-stage)

Providing full information (sole-planning) helps but does not solve the problem.

NumbersDirect GPT-4 final pass rate ≈ 4.4% (validation/test) vs 0.6% two-stage

Practical UseSeparating information retrieval from planning (human or reliable retriever) improves results; invest engineering effort in robust data collection before planning.

Evidence RefTable 3 (validation & test)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Final Pass Rate (GPT-4-Turbo, two-stage, test)0.6%Test (two-stage)GPT-4 final pass rate 0.6% on testTable 3
Delivery Rate (GPT-4-Turbo, two-stage, test)93.1%Test (two-stage)High delivery rate but low final passTable 3

What To Try In 7 Days

Run TravelPlanner (or a subset) on your agent to measure tool-use and final pass gaps.

Add argument validation and loop-detection for tool calls to cut invalid-action errors.

Separate reliable data retrieval from planning (human-orchestrated retriever) and run sole-planning to improve results quickly.

Agent Features

Memory
Notebook (task working memory)Short-term in-context memory (working memory)
Planning
ReActReflexionDirectZero-shot Chain-of-Thought
Tool Use
FlightSearchDistanceMatrixRestaurantSearchAttractionSearchAccommodationSearchCitySearchNotebookWrite
Frameworks
ReActReflexionDirectZS-CoT
Is Agentic

Yes

Architectures
Large language models (decoder/decoder-encoder style)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Static sandbox may not reflect live web changes or noisy real-world APIs.

Commonsense evaluation is author-defined and may not match all user expectations.

When Not To Use

For narrow single-objective tasks where traditional benchmarks suffice.

When you need live, up-to-date web data rather than a controlled static sandbox.

Failure Modes

Incorrect or malformed tool arguments (argument errors)

Dead loops / repeated invalid actions

Core Entities

Models

GPT-4-TurboGPT-3.5-TurboGemini ProMixtral-8x7B-MoEMistral-7B-32K

Metrics

Delivery RateCommonsense Pass Rate (micro)Commonsense Pass Rate (macro)Hard Constraint Pass Rate (micro)Hard Constraint Pass Rate (macro)Final Pass Rate

Datasets

TravelPlanner (1,225 queries, static sandbox ~4M entries)

Benchmarks

TravelPlanner