StackPlanner: centralized coordinator + active task stack + reusable experience memory for stable long-horizon multi-agent collaboration

January 9, 20266 min

Overview

Decision SnapshotNeeds Validation

The idea is practical: explicit memory control and experience reuse directly address known failure modes in centralized agents; experiments on public benchmarks support improvements, but real-world readiness needs more multi-turn and cold-start testing.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals7

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Ruizhe Zhang, Xinke Jiang, Zhibang Yang, Zhixin Zhang, Jiaran Gao, Yuzhen Xiao, Hongbin Lai, Xu Chu, Junfeng Zhao, Yasha Wang

Links

Abstract / PDF

Why It Matters For Business

Active memory control and reusable experience reduce error propagation in multi-agent workflows, improving reliability and reuse across tasks so teams get better multi-step outputs with fewer retries.

Who Should Care

Summary TLDR

StackPlanner is a centralized, hierarchical multi-agent framework that treats memory as an explicit control target. A central coordinator issues PLAN/DELEGATE/REVISE actions while sub-agents execute tasks. Key features: (1) an active task-memory stack with explicit update/condense/prune (REVISE) operations to avoid context bloat; (2) a structured experience memory (user profiles, semantic facts, procedural SOPs) to reuse coordination experience; (3) coordinator trained with a token-level RL scheme (GRPO) that interleaves retrieval, reasoning, and memory actions. On multi-hop QA and agentic benchmarks, StackPlanner outperforms baselines (e.g., 32.92% vs 29.55% F1 on 2Wiki with Qwen2.5-3B) and

Problem Statement

Centralized multi-agent coordinators suffer two linked problems: (1) task memory grows noisy and bloated during long, multi-step workflows, causing error accumulation and degraded plans; (2) coordinators lack reusable cross-task experience, so they cold-start on new tasks and fail to generalize coordination strategies.

Main Contribution

A hierarchical centralized architecture that decouples high-level coordination from sub-agent execution.

An active task-memory stack with explicit REVISE actions: Update, Condense (summarize), and Prune.

Key Findings

StackPlanner yields higher F1 than prior agentic RL baselines on multi-hop QA.

Numbers2Wiki F1 32.92% (Ours, Qwen2.5-3B) vs 29.55% (ARPO); +3.37 pts

Practical UseUse StackPlanner-style coordination to raise answer quality on multi-step retrieval QA by a few F1 points in similar setups.

Evidence RefTable 1; Section 3.2

Memory modules materially improve performance; removing both causes the largest drop.

NumbersRemoving both memories drops 2Wiki F1 by 15.80 pts; other datasets drop 510 pts

Practical UseInvest engineering effort in both task-stack control and long-term experience storage to avoid big quality regressions.

Evidence RefSection 3.3; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
F132.92%ARPO 29.55%+3.37 pts2Wiki (Qwen2.5-3B)Table 1; main textTable 1
F116.48%ARPO 13.38%+3.10 ptsMuSiQue (Qwen2.5-3B)Table 1; main textTable 1

What To Try In 7 Days

Add a central coordinator that issues high-level PLAN/DELEGATE/REVISE commands.

Implement a task-memory stack with explicit condense/prune operations and log pruning reasons.

Capture procedural patterns (SOPs) from completed tasks and add a small experience store retrievable by task type.

Agent Features

Memory
task-memory stack (explicit revise ops)structured experience memory (profiles, semantic, procedural)
Planning
high-level coordinator planningdiscrete action space (PLAN/DELEGATE/REVISE)
Tool Use
search and web toolssub-agent tool invocation (ReAct)
Frameworks
REACTRAG-style retrievalGRPO
Is Agentic

Yes

Architectures
centralized hierarchical
Collaboration
central coordinator delegating to specialized sub-agents

Optimization Features

Token Efficiency
memory condensation to reduce context length
Training Optimization
GRPO

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Limited support for multi-turn conversational dependencies; current task memory targets single-turn workflows.

Cold-start issues for long-term experience memory; initial stored experiences may not generalize to diverse real users.

When Not To Use

Applications that require rich multi-turn conversational state across many user turns.

Low-resource settings where building a useful experience memory is impractical.

Failure Modes

If REVISE is misconfigured, useful context can be pruned and harm downstream reasoning.

Experience retrieval mismatch: retrieving irrelevant SOPs can mislead planning.

Core Entities

Models

Qwen2.5-3BQwen2.5-7B

Metrics

F1

Datasets

2WikiMultiHopQAMuSiQueGAIAFRAMES

Benchmarks

multi-hop QAGAIAFRAMES