TravelPlanner: a realistic travel-planning benchmark — GPT-4 reaches only 0.6% full success on test tasks

February 2, 20247 min

Overview

Production Readiness

1

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

13

Authors

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, Yu Su

Links

Abstract / PDF

Why It Matters For Business

Current LLM agents are not yet reliable enough to fully automate complex multi-constraint planning; but they can draft plans quickly and cut human effort if paired with verification and robust data collection.

Summary TLDR

TravelPlanner is a realistic benchmark that tests language agents on multi-day travel planning with many interdependent decisions and constraints. It provides a static sandbox of about 4 million records, six queryable tools (flights, hotels, restaurants, attractions, distances, city search), and 1,225 human-validated user queries with reference plans. Evaluations across five LLMs and four planning strategies show current agents fail at holistic constrained planning: GPT-4 achieves 0.6% final pass rate in the full (two-stage) setting, and other models score near zero. Agents struggle with tool arguments, repetitive dead loops, tracking multiple constraints, and global budget/night constraints

Problem Statement

Can modern LLM-powered agents perform realistic long-horizon, multi-constraint planning that requires iterative tool use, memory, and commonsense? TravelPlanner builds a controlled sandbox and a set of diverse travel queries to measure whether agents can gather information via tools and produce feasible plans that satisfy both explicit user constraints (budget, room rules, cuisine, transport) and commonsense constraints.

Main Contribution

A realistic planning benchmark focused on travel: 1,225 validated queries plus reference plans and a static sandbox of ~4M data entries accessible via six tools.

An automated evaluation suite that scores delivery rate, commonsense constraints (micro/macro), hard constraints (micro/macro), and final pass rate.

Comprehensive experiments across five LLMs (GPT-4-Turbo, GPT-3.5-Turbo, Gemini Pro, Mixtral, Mistral-7B-32K) and four planning strategies (Direct, ZS-CoT, ReAct, Reflexion).

A diagnostic analysis of failure modes (tool-argument errors, dead loops, hallucinations, lost-in-the-middle) and concrete dataset/tool statistics to guide improvements.

Key Findings

State-of-the-art LLMs largely fail to produce fully feasible travel plans.

NumbersGPT-4 final pass rate = 0.6% on test set (two-stage)

Providing full information (sole-planning) helps but does not solve the problem.

NumbersDirect GPT-4 final pass rate ≈ 4.4% (validation/test) vs 0.6% two-stage

Agents underuse tools compared to human reference plans.

NumbersFor 7-day plans: agent FlightSearch avg 0.8 calls vs reference 4.0 calls

Tool-use and environment feedback errors are common causes of failure.

NumbersInvalid actions = 37.3% of errors; repetitive action loops = 6.0%

Agents satisfy some constraints individually but fail holistically.

NumbersHigh micro pass but near-zero macro pass for many agents (macro final pass often 0%)

Human annotators take ~12 minutes per plan; agents produce drafts in 1–2 minutes.

NumbersHuman annotation time ≈ 12 min; agent generation 1–2 min

Results

Final Pass Rate (GPT-4-Turbo, two-stage, test)

Value0.6%

Delivery Rate (GPT-4-Turbo, two-stage, test)

Value93.1%

Commonsense Pass Rate (micro) (GPT-4-Turbo, two-stage, test)

Value63.3%

Hard Constraint Pass Rate (micro) (GPT-4-Turbo, two-stage, test)

Value10.5%

Final Pass Rate (Direct GPT-4, sole-planning, test)

Value4.4%

Who Should Care

What To Try In 7 Days

Run TravelPlanner (or a subset) on your agent to measure tool-use and final pass gaps.

Add argument validation and loop-detection for tool calls to cut invalid-action errors.

Separate reliable data retrieval from planning (human-orchestrated retriever) and run sole-planning to improve results quickly.

Agent Features

Memory

  • Notebook (task working memory)
  • Short-term in-context memory (working memory)

Planning

  • ReAct
  • Reflexion
  • Direct
  • Zero-shot Chain-of-Thought

Tool Use

  • FlightSearch
  • DistanceMatrix
  • RestaurantSearch
  • AttractionSearch
  • AccommodationSearch
  • CitySearch
  • NotebookWrite

Frameworks

  • ReAct
  • Reflexion
  • Direct
  • ZS-CoT

Is Agentic

true

Architectures

  • Large language models (decoder/decoder-encoder style)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Static sandbox may not reflect live web changes or noisy real-world APIs.
  • Commonsense evaluation is author-defined and may not match all user expectations.
  • Budget recalibration from annotators might bias feasibility toward annotated plans.

When Not To Use

  • For narrow single-objective tasks where traditional benchmarks suffice.
  • When you need live, up-to-date web data rather than a controlled static sandbox.

Failure Modes

  • Incorrect or malformed tool arguments (argument errors)
  • Dead loops / repeated invalid actions
  • Hallucinations from missing data or confused information
  • Lost-in-the-middle when handling long contexts
  • Mismatch between internal reasoning and executed actions

Core Entities

Models

  • GPT-4-Turbo
  • GPT-3.5-Turbo
  • Gemini Pro
  • Mixtral-8x7B-MoE
  • Mistral-7B-32K

Metrics

  • Delivery Rate
  • Commonsense Pass Rate (micro)
  • Commonsense Pass Rate (macro)
  • Hard Constraint Pass Rate (micro)
  • Hard Constraint Pass Rate (macro)
  • Final Pass Rate

Datasets

  • TravelPlanner (1,225 queries, static sandbox ~4M entries)

Benchmarks

  • TravelPlanner