Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
5
Why It Matters For Business
TDAG reduces failure cascades and improves partial progress tracking, so agent-driven multi-step workflows are more reliable and auditable.
Summary TLDR
This paper introduces TDAG, a multi-agent system that (1) dynamically decomposes a complex task into subtasks that can change as results arrive, and (2) auto-generates tailored subagents (via LLM prompting) for each subtask. The authors pair TDAG with ItineraryBench, a travel-planning benchmark that scores partial progress across three levels (executability, constraint satisfaction, efficiency). On ItineraryBench TDAG averages 49.08 vs baselines ~43–45, and ablations show both dynamic decomposition and agent generation are important. Code and data are available.
Problem Statement
LLM-based agents struggle on long, multi-step real-world tasks because fixed task decompositions cause error propagation and manually built subagents lack adaptability. Existing benchmarks often report only binary success/failure and miss partial progress.
Main Contribution
ItineraryBench: a travel-planning benchmark with 364 test scenarios and fine-grained, three-level scoring.
TDAG: a multi-agent framework that dynamically adjusts task decomposition and generates subagents tailored per subtask.
Empirical evaluation and ablations showing TDAG improves overall scores and reduces cascading failures compared to popular baselines.
Key Findings
TDAG achieves higher average score on ItineraryBench than baselines
Removing components degrades performance
TDAG greatly reduces cascading task failures
TDAG generalizes to other simulated tasks
Results
ItineraryBench average score
Ablation: remove agent generation
Cascading Task Failure (CTF) share
WebShop reward / success
TextCraft success rate
Who Should Care
What To Try In 7 Days
Run ItineraryBench on your agent to measure partial-task performance.
Prototype dynamic decomposition: split a complex workflow and replan when a subtask fails.
Generate simple subagents via LLM prompts and add a small skill library for reuse.
Agent Features
Memory
- incremental skill library (retrieval via SentenceBERT)
Planning
- dynamic task decomposition
- sequential subtask planning
Tool Use
- database access
- python interpreter
Frameworks
- TDAG
Is Agentic
true
Architectures
- multi-agent
Collaboration
- main agent coordinates subagents
Optimization Features
Token Efficiency
- decomposition reduces irrelevant context
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Benchmark focuses on travel planning; generality beyond tested simulators is limited.
- Skill correctness in the library is not guaranteed and requires ongoing refinement.
- Approach increases LLM call volume and cost due to agent generation and summaries.
- Tool set is narrow (database + Python); real-world tool diversity not evaluated.
When Not To Use
- For cheap, single-step tasks where a single LLM is sufficient.
- When token/compute budget cannot afford multiple generated subagents per task.
Failure Modes
- Cascading failures if decomposition or replan logic is flawed.
- Hallucinations causing external information misalignment with databases.
- Skill drift: stored skills become outdated or incorrect over time.
Core Entities
Models
- gpt-3.5-turbo-16k
- gpt-3.5-turbo
- gpt-3.5-turbo-instruct
- all-mpnet-base-v2 (SentenceBERT)
Metrics
- three-level fine-grained score (Executability / Constraint / Efficiency)
- binary success (for comparison)
- reward score (WebShop)
- success rate (TextCraft)
Datasets
- ItineraryBench (new)
- WebShop
- TextCraft
Benchmarks
- ItineraryBench
- WebShop
- TextCraft

