StackPlanner: centralized coordinator + active task stack + reusable experience memory for stable long-horizon multi-agent collaboration

Overview

Decision SnapshotNeeds Validation

The idea is practical: explicit memory control and experience reuse directly address known failure modes in centralized agents; experiments on public benchmarks support improvements, but real-world readiness needs more multi-turn and cold-start testing.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals7

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Ruizhe Zhang, Xinke Jiang, Zhibang Yang, Zhixin Zhang, Jiaran Gao, Yuzhen Xiao, Hongbin Lai, Xu Chu, Junfeng Zhao, Yasha Wang

Links

Abstract / PDF

Why It Matters For Business

Active memory control and reusable experience reduce error propagation in multi-agent workflows, improving reliability and reuse across tasks so teams get better multi-step outputs with fewer retries.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Data Scientist

Summary TLDR

StackPlanner is a centralized, hierarchical multi-agent framework that treats memory as an explicit control target. A central coordinator issues PLAN/DELEGATE/REVISE actions while sub-agents execute tasks. Key features: (1) an active task-memory stack with explicit update/condense/prune (REVISE) operations to avoid context bloat; (2) a structured experience memory (user profiles, semantic facts, procedural SOPs) to reuse coordination experience; (3) coordinator trained with a token-level RL scheme (GRPO) that interleaves retrieval, reasoning, and memory actions. On multi-hop QA and agentic benchmarks, StackPlanner outperforms baselines (e.g., 32.92% vs 29.55% F1 on 2Wiki with Qwen2.5-3B) and

Problem Statement

Centralized multi-agent coordinators suffer two linked problems: (1) task memory grows noisy and bloated during long, multi-step workflows, causing error accumulation and degraded plans; (2) coordinators lack reusable cross-task experience, so they cold-start on new tasks and fail to generalize coordination strategies.

Main Contribution

A hierarchical centralized architecture that decouples high-level coordination from sub-agent execution.

An active task-memory stack with explicit REVISE actions: Update, Condense (summarize), and Prune.

Key Findings

StackPlanner yields higher F1 than prior agentic RL baselines on multi-hop QA.

Numbers2Wiki F1 32.92% (Ours, Qwen2.5-3B) vs 29.55% (ARPO); +3.37 pts

Practical UseUse StackPlanner-style coordination to raise answer quality on multi-step retrieval QA by a few F1 points in similar setups.

Evidence RefTable 1; Section 3.2

Memory modules materially improve performance; removing both causes the largest drop.

NumbersRemoving both memories drops 2Wiki F1 by 15.80 pts; other datasets drop 5–10 pts

Practical UseInvest engineering effort in both task-stack control and long-term experience storage to avoid big quality regressions.

Evidence RefSection 3.3; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
F1	32.92%	ARPO 29.55%	+3.37 pts	2Wiki (Qwen2.5-3B)	Table 1; main text	Table 1
F1	16.48%	ARPO 13.38%	+3.10 pts	MuSiQue (Qwen2.5-3B)	Table 1; main text	Table 1

What To Try In 7 Days

Add a central coordinator that issues high-level PLAN/DELEGATE/REVISE commands.

Implement a task-memory stack with explicit condense/prune operations and log pruning reasons.

Capture procedural patterns (SOPs) from completed tasks and add a small experience store retrievable by task type.

Agent Features

Memory

task-memory stack (explicit revise ops)structured experience memory (profiles, semantic, procedural)

Planning

high-level coordinator planningdiscrete action space (PLAN/DELEGATE/REVISE)

Tool Use

search and web toolssub-agent tool invocation (ReAct)

Frameworks

REACTRAG-style retrievalGRPO

Is Agentic

Yes

Architectures

centralized hierarchical

Collaboration

central coordinator delegating to specialized sub-agents

Optimization Features

Token Efficiency

memory condensation to reduce context length

Training Optimization

GRPO

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Limited support for multi-turn conversational dependencies; current task memory targets single-turn workflows.

Cold-start issues for long-term experience memory; initial stored experiences may not generalize to diverse real users.

When Not To Use

Applications that require rich multi-turn conversational state across many user turns.

Low-resource settings where building a useful experience memory is impractical.

Failure Modes

If REVISE is misconfigured, useful context can be pruned and harm downstream reasoning.

Experience retrieval mismatch: retrieving irrelevant SOPs can mislead planning.

Core Entities

Models

Qwen2.5-3BQwen2.5-7B

Metrics

Datasets

2WikiMultiHopQAMuSiQueGAIAFRAMES

Benchmarks

multi-hop QAGAIAFRAMES

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

StackPlanner yields higher F1 than prior agentic RL baselines on multi-hop QA.

Memory modules materially improve performance; removing both causes the largest drop.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

Key finding

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Key finding

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

Key finding

G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

Key finding