StackPlanner: centralized coordinator + active task stack + reusable experience memory for stable long-horizon multi-agent collaboration

January 9, 20266 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Ruizhe Zhang, Xinke Jiang, Zhibang Yang, Zhixin Zhang, Jiaran Gao, Yuzhen Xiao, Hongbin Lai, Xu Chu, Junfeng Zhao, Yasha Wang

Links

Abstract / PDF

Why It Matters For Business

Active memory control and reusable experience reduce error propagation in multi-agent workflows, improving reliability and reuse across tasks so teams get better multi-step outputs with fewer retries.

Summary TLDR

StackPlanner is a centralized, hierarchical multi-agent framework that treats memory as an explicit control target. A central coordinator issues PLAN/DELEGATE/REVISE actions while sub-agents execute tasks. Key features: (1) an active task-memory stack with explicit update/condense/prune (REVISE) operations to avoid context bloat; (2) a structured experience memory (user profiles, semantic facts, procedural SOPs) to reuse coordination experience; (3) coordinator trained with a token-level RL scheme (GRPO) that interleaves retrieval, reasoning, and memory actions. On multi-hop QA and agentic benchmarks, StackPlanner outperforms baselines (e.g., 32.92% vs 29.55% F1 on 2Wiki with Qwen2.5-3B) and

Problem Statement

Centralized multi-agent coordinators suffer two linked problems: (1) task memory grows noisy and bloated during long, multi-step workflows, causing error accumulation and degraded plans; (2) coordinators lack reusable cross-task experience, so they cold-start on new tasks and fail to generalize coordination strategies.

Main Contribution

A hierarchical centralized architecture that decouples high-level coordination from sub-agent execution.

An active task-memory stack with explicit REVISE actions: Update, Condense (summarize), and Prune.

A structured experience memory storing user profiles, factual (semantic) memory, and procedural SOPs for cross-task reuse.

A reinforcement-learning training pipeline for the coordinator using Group Relative Policy Optimization (GRPO) that conditions on retrieval and memory operations.

Empirical evaluation on multi-hop QA and agentic benchmarks showing improved F1 and better generalization.

Key Findings

StackPlanner yields higher F1 than prior agentic RL baselines on multi-hop QA.

Numbers2Wiki F1 32.92% (Ours, Qwen2.5-3B) vs 29.55% (ARPO); +3.37 pts

Memory modules materially improve performance; removing both causes the largest drop.

NumbersRemoving both memories drops 2Wiki F1 by 15.80 pts; other datasets drop 5–10 pts

Experience memory particularly helps multi-step retrieval tasks.

NumbersExcluding experience memory reduces MuSiQue F1 by 7.49 pts (Qwen2.5-3B)

Results

F1

Value32.92%

BaselineARPO 29.55%

F1

Value16.48%

BaselineARPO 13.38%

F1

Value7.71%

BaselineARPO 7.71%

F1

Value16.23%

BaselineARPO 13.49%

F1

Value38.34%

BaselineARPO 30.71%

Who Should Care

What To Try In 7 Days

Add a central coordinator that issues high-level PLAN/DELEGATE/REVISE commands.

Implement a task-memory stack with explicit condense/prune operations and log pruning reasons.

Capture procedural patterns (SOPs) from completed tasks and add a small experience store retrievable by task type.

Agent Features

Memory

  • task-memory stack (explicit revise ops)
  • structured experience memory (profiles, semantic, procedural)

Planning

  • high-level coordinator planning
  • discrete action space (PLAN/DELEGATE/REVISE)

Tool Use

  • search and web tools
  • sub-agent tool invocation (ReAct)

Frameworks

  • REACT
  • RAG-style retrieval
  • GRPO

Is Agentic

true

Architectures

  • centralized hierarchical

Collaboration

  • central coordinator delegating to specialized sub-agents

Optimization Features

Token Efficiency

  • memory condensation to reduce context length

Training Optimization

  • GRPO

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Limited support for multi-turn conversational dependencies; current task memory targets single-turn workflows.
  • Cold-start issues for long-term experience memory; initial stored experiences may not generalize to diverse real users.

When Not To Use

  • Applications that require rich multi-turn conversational state across many user turns.
  • Low-resource settings where building a useful experience memory is impractical.

Failure Modes

  • If REVISE is misconfigured, useful context can be pruned and harm downstream reasoning.
  • Experience retrieval mismatch: retrieving irrelevant SOPs can mislead planning.
  • High inference latency: reported 40–300s per sample for complex tasks may be too slow for real-time use.

Core Entities

Models

  • Qwen2.5-3B
  • Qwen2.5-7B

Metrics

  • F1

Datasets

  • 2WikiMultiHopQA
  • MuSiQue
  • GAIA
  • FRAMES

Benchmarks

  • multi-hop QA
  • GAIA
  • FRAMES