TravelPlanner: a realistic travel-planning benchmark — GPT-4 reaches only 0.6% full success on test tasks

Overview

Decision SnapshotReady For Pilot

The paper provides a well-documented benchmark, multiple model comparisons, and error analyses showing consistent failures; evidence is experimental and systematic but limited to a static sandbox.

Citations13

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 100%

Novelty: 70%

Authors

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, Yu Su

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Current LLM agents are not yet reliable enough to fully automate complex multi-constraint planning; but they can draft plans quickly and cut human effort if paired with verification and robust data collection.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

TravelPlanner is a realistic benchmark that tests language agents on multi-day travel planning with many interdependent decisions and constraints. It provides a static sandbox of about 4 million records, six queryable tools (flights, hotels, restaurants, attractions, distances, city search), and 1,225 human-validated user queries with reference plans. Evaluations across five LLMs and four planning strategies show current agents fail at holistic constrained planning: GPT-4 achieves 0.6% final pass rate in the full (two-stage) setting, and other models score near zero. Agents struggle with tool arguments, repetitive dead loops, tracking multiple constraints, and global budget/night constraints

Problem Statement

Can modern LLM-powered agents perform realistic long-horizon, multi-constraint planning that requires iterative tool use, memory, and commonsense? TravelPlanner builds a controlled sandbox and a set of diverse travel queries to measure whether agents can gather information via tools and produce feasible plans that satisfy both explicit user constraints (budget, room rules, cuisine, transport) and commonsense constraints.

Main Contribution

A realistic planning benchmark focused on travel: 1,225 validated queries plus reference plans and a static sandbox of ~4M data entries accessible via six tools.

An automated evaluation suite that scores delivery rate, commonsense constraints (micro/macro), hard constraints (micro/macro), and final pass rate.

Key Findings

State-of-the-art LLMs largely fail to produce fully feasible travel plans.

NumbersGPT-4 final pass rate = 0.6% on test set (two-stage)

Practical UseDo not expect current off-the-shelf agents to reliably solve multi-constraint planning; use human verification or simpler workflows in production.

Evidence RefTable 3 (test, two-stage)

Providing full information (sole-planning) helps but does not solve the problem.

NumbersDirect GPT-4 final pass rate ≈ 4.4% (validation/test) vs 0.6% two-stage

Practical UseSeparating information retrieval from planning (human or reliable retriever) improves results; invest engineering effort in robust data collection before planning.

Evidence RefTable 3 (validation & test)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Final Pass Rate (GPT-4-Turbo, two-stage, test)	0.6%	—	—	Test (two-stage)	GPT-4 final pass rate 0.6% on test	Table 3
Delivery Rate (GPT-4-Turbo, two-stage, test)	93.1%	—	—	Test (two-stage)	High delivery rate but low final pass	Table 3

What To Try In 7 Days

Run TravelPlanner (or a subset) on your agent to measure tool-use and final pass gaps.

Add argument validation and loop-detection for tool calls to cut invalid-action errors.

Separate reliable data retrieval from planning (human-orchestrated retriever) and run sole-planning to improve results quickly.

Agent Features

Memory

Notebook (task working memory)Short-term in-context memory (working memory)

Planning

ReActReflexionDirectZero-shot Chain-of-Thought

Tool Use

FlightSearchDistanceMatrixRestaurantSearchAttractionSearchAccommodationSearchCitySearchNotebookWrite

Frameworks

ReActReflexionDirectZS-CoT

Is Agentic

Yes

Architectures

Large language models (decoder/decoder-encoder style)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://osu-nlp-group.github.io/TravelPlanner

Data URLs

https://osu-nlp-group.github.io/TravelPlanner

Risks & Boundaries

Limitations

Static sandbox may not reflect live web changes or noisy real-world APIs.

Commonsense evaluation is author-defined and may not match all user expectations.

When Not To Use

For narrow single-objective tasks where traditional benchmarks suffice.

When you need live, up-to-date web data rather than a controlled static sandbox.

Failure Modes

Incorrect or malformed tool arguments (argument errors)

Dead loops / repeated invalid actions

Core Entities

Models

GPT-4-TurboGPT-3.5-TurboGemini ProMixtral-8x7B-MoEMistral-7B-32K

Metrics

Delivery RateCommonsense Pass Rate (micro)Commonsense Pass Rate (macro)Hard Constraint Pass Rate (micro)Hard Constraint Pass Rate (macro)Final Pass Rate

Datasets

TravelPlanner (1,225 queries, static sandbox ~4M entries)

Benchmarks

TravelPlanner

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

State-of-the-art LLMs largely fail to produce fully feasible travel plans.

Providing full information (sole-planning) helps but does not solve the problem.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding