TDAG: dynamically split complex tasks and auto-generate subagents to improve multi-step agent performance

February 15, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

5

Authors

Yaoxiang Wang, Zhiyong Wu, Junfeng Yao, Jinsong Su

Links

Abstract / PDF

Why It Matters For Business

TDAG reduces failure cascades and improves partial progress tracking, so agent-driven multi-step workflows are more reliable and auditable.

Summary TLDR

This paper introduces TDAG, a multi-agent system that (1) dynamically decomposes a complex task into subtasks that can change as results arrive, and (2) auto-generates tailored subagents (via LLM prompting) for each subtask. The authors pair TDAG with ItineraryBench, a travel-planning benchmark that scores partial progress across three levels (executability, constraint satisfaction, efficiency). On ItineraryBench TDAG averages 49.08 vs baselines ~43–45, and ablations show both dynamic decomposition and agent generation are important. Code and data are available.

Problem Statement

LLM-based agents struggle on long, multi-step real-world tasks because fixed task decompositions cause error propagation and manually built subagents lack adaptability. Existing benchmarks often report only binary success/failure and miss partial progress.

Main Contribution

ItineraryBench: a travel-planning benchmark with 364 test scenarios and fine-grained, three-level scoring.

TDAG: a multi-agent framework that dynamically adjusts task decomposition and generates subagents tailored per subtask.

Empirical evaluation and ablations showing TDAG improves overall scores and reduces cascading failures compared to popular baselines.

Key Findings

TDAG achieves higher average score on ItineraryBench than baselines

NumbersTDAG avg 49.08 vs ReAct 43.02 (Table 2)

Removing components degrades performance

Numbersw/o agent generation avg 46.69; w/o dynamic decomposition avg 46.23 (Table 2)

TDAG greatly reduces cascading task failures

NumbersCTF share: TDAG 4.35% vs ReAct 32.61% (Table 3)

TDAG generalizes to other simulated tasks

NumbersWebShop reward 64.5 vs ReAct 42.1; TextCraft success 73.5% vs ReAct 19% (Table 4)

Results

ItineraryBench average score

Value49.08 (TDAG)

BaselineReAct 43.02

Ablation: remove agent generation

Value46.69 (TDAG w/o Agent Generation)

BaselineTDAG 49.08

Cascading Task Failure (CTF) share

Value4.35% (TDAG)

BaselineReAct 32.61%

WebShop reward / success

Value64.5 reward, 45.0% success (TDAG)

BaselineReAct 42.1 reward, 29.0% success

TextCraft success rate

Value73.5% (TDAG)

BaselineReAct 19.0%

Who Should Care

What To Try In 7 Days

Run ItineraryBench on your agent to measure partial-task performance.

Prototype dynamic decomposition: split a complex workflow and replan when a subtask fails.

Generate simple subagents via LLM prompts and add a small skill library for reuse.

Agent Features

Memory

  • incremental skill library (retrieval via SentenceBERT)

Planning

  • dynamic task decomposition
  • sequential subtask planning

Tool Use

  • database access
  • python interpreter

Frameworks

  • TDAG

Is Agentic

true

Architectures

  • multi-agent

Collaboration

  • main agent coordinates subagents

Optimization Features

Token Efficiency

  • decomposition reduces irrelevant context

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Benchmark focuses on travel planning; generality beyond tested simulators is limited.
  • Skill correctness in the library is not guaranteed and requires ongoing refinement.
  • Approach increases LLM call volume and cost due to agent generation and summaries.
  • Tool set is narrow (database + Python); real-world tool diversity not evaluated.

When Not To Use

  • For cheap, single-step tasks where a single LLM is sufficient.
  • When token/compute budget cannot afford multiple generated subagents per task.

Failure Modes

  • Cascading failures if decomposition or replan logic is flawed.
  • Hallucinations causing external information misalignment with databases.
  • Skill drift: stored skills become outdated or incorrect over time.

Core Entities

Models

  • gpt-3.5-turbo-16k
  • gpt-3.5-turbo
  • gpt-3.5-turbo-instruct
  • all-mpnet-base-v2 (SentenceBERT)

Metrics

  • three-level fine-grained score (Executability / Constraint / Efficiency)
  • binary success (for comparison)
  • reward score (WebShop)
  • success rate (TextCraft)

Datasets

  • ItineraryBench (new)
  • WebShop
  • TextCraft

Benchmarks

  • ItineraryBench
  • WebShop
  • TextCraft