Overview
The idea is practical: explicit memory control and experience reuse directly address known failure modes in centralized agents; experiments on public benchmarks support improvements, but real-world readiness needs more multi-turn and cold-start testing.
Citations0
Evidence Strength0.70
Confidence0.78
Risk Signals7
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Active memory control and reusable experience reduce error propagation in multi-agent workflows, improving reliability and reuse across tasks so teams get better multi-step outputs with fewer retries.
Who Should Care
Summary TLDR
StackPlanner is a centralized, hierarchical multi-agent framework that treats memory as an explicit control target. A central coordinator issues PLAN/DELEGATE/REVISE actions while sub-agents execute tasks. Key features: (1) an active task-memory stack with explicit update/condense/prune (REVISE) operations to avoid context bloat; (2) a structured experience memory (user profiles, semantic facts, procedural SOPs) to reuse coordination experience; (3) coordinator trained with a token-level RL scheme (GRPO) that interleaves retrieval, reasoning, and memory actions. On multi-hop QA and agentic benchmarks, StackPlanner outperforms baselines (e.g., 32.92% vs 29.55% F1 on 2Wiki with Qwen2.5-3B) and
Problem Statement
Centralized multi-agent coordinators suffer two linked problems: (1) task memory grows noisy and bloated during long, multi-step workflows, causing error accumulation and degraded plans; (2) coordinators lack reusable cross-task experience, so they cold-start on new tasks and fail to generalize coordination strategies.
Main Contribution
A hierarchical centralized architecture that decouples high-level coordination from sub-agent execution.
An active task-memory stack with explicit REVISE actions: Update, Condense (summarize), and Prune.
Key Findings
StackPlanner yields higher F1 than prior agentic RL baselines on multi-hop QA.
Memory modules materially improve performance; removing both causes the largest drop.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| F1 | 32.92% | ARPO 29.55% | +3.37 pts | 2Wiki (Qwen2.5-3B) | Table 1; main text | Table 1 |
| F1 | 16.48% | ARPO 13.38% | +3.10 pts | MuSiQue (Qwen2.5-3B) | Table 1; main text | Table 1 |
What To Try In 7 Days
Add a central coordinator that issues high-level PLAN/DELEGATE/REVISE commands.
Implement a task-memory stack with explicit condense/prune operations and log pruning reasons.
Capture procedural patterns (SOPs) from completed tasks and add a small experience store retrievable by task type.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Limited support for multi-turn conversational dependencies; current task memory targets single-turn workflows.
Cold-start issues for long-term experience memory; initial stored experiences may not generalize to diverse real users.
When Not To Use
Applications that require rich multi-turn conversational state across many user turns.
Low-resource settings where building a useful experience memory is impractical.
Failure Modes
If REVISE is misconfigured, useful context can be pruned and harm downstream reasoning.
Experience retrieval mismatch: retrieving irrelevant SOPs can mislead planning.

