Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.25
Citation Count
3
Why It Matters For Business
If your product needs reliable multi-step decisions, splitting planning into specialized LLM modules reduces incorrect actions and improves transferability; you can also trade off accuracy and cost by using smaller models and caching.
Summary TLDR
The paper introduces MAP, a Modular Agentic Planner that composes multiple small LLM-based modules (TaskDecomposer, Actor, Monitor, Predictor, Evaluator, Orchestrator) that interact recurrently to produce and check plans. On several benchmarks—graph traversal (CogEval tasks), Tower of Hanoi (ToH), PlanBench, and StrategyQA—MAP improves correctness and dramatically reduces invalid actions versus single-run LLM prompting and other multi-agent/tree search baselines. MAP can run with a smaller LLM (Llama3-70B) and benefits from caching to reduce cost, but it is computationally expensive and still fails on some hard cases.
Problem Statement
LLMs are strong at single-step outputs but fail on goal-directed, multi-step planning: they hallucinate invalid actions, loop, or lose track of multi-step consequences. The paper asks whether planning improves if planning functions are split into specialized LLM modules that propose actions, predict next states, evaluate outcomes, monitor validity, decompose goals, and orchestrate progress.
Main Contribution
Design of MAP, a modular agentic planner made of specialized LLM modules that interact recurrently to search at the level of states.
Implementation details: each module is an LLM prompt + ≤3 few-shot examples; tree search at state level (B=2, L=2) and action-filtering via a Monitor.
Empirical evaluation across four domains: graph traversal (CogEval), Tower of Hanoi, two PlanBench domains, and StrategyQA, showing better % solved, fewer invalid actions, transfer gains, and useful ablations.
Ablation shows Monitor is critical; MAP still works with smaller LLM (Llama3-70B); caching cuts API/token cost substantially.
Key Findings
MAP solved the Valuepath graph task on evaluated problems
On 3-disk Tower of Hanoi, MAP solved more problems than baselines
MAP produces far fewer invalid actions (filters hallucinations)
MAP transfers better between planning problems
MAP can be implemented with a smaller LLM and still help
Results
Valuepath % solved (MAP)
Steppath % solved (MAP)
Tower of Hanoi % solved (3-disk)
Tower of Hanoi % invalid actions
PlanBench (subset Logistics) % solved
Accuracy
Transfer n7tree→n15star % solved
Who Should Care
What To Try In 7 Days
Prototype a Monitor module to check and reject invalid actions in an existing LLM pipeline.
Add a simple TaskDecomposer to break a multi-step task into subgoals and compare error rates.
Run MAP-style Actor+Predictor+Evaluator loop with a small model (Llama3-70B) and cache repeated module outputs to estimate cost vs. gain.
Agent Features
Memory
- short-term caching of module outputs
Planning
- task decomposition
- tree search
- action proposal loop
- goal orchestration
Frameworks
- prompting + few-shot in-context learning
Is Agentic
true
Architectures
- modular multi-agent
- recurrent module interaction
- state-level tree search
Collaboration
- multi-agent interaction between specialized modules
Optimization Features
Token Efficiency
- L=2 depth recommended; L=3 had marginal gains with much higher tokens (Table 14)
Infra Optimization
- high API-call throughput and token budget required; caching reduces calls (Table 16)
Model Optimization
- works with smaller Llama3-70B model
System Optimization
- separate modules to enable selective caching and parallel calls
Inference Optimization
- cache and reuse module outputs to cut cost
- Accuracy
Reproducibility
Data Urls
- StrategyQA
- PlanBench
- CogEval tasks
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High computational and token cost compared with single-call prompting (see Table 16).
- Performance still suboptimal on some tasks (Reward Revaluation, large PlanBench problems, ToH 4-disk OOD).
- Relies on prompting/few-shot to specialize modules; may improve with joint fine-tuning of module models.
When Not To Use
- Low-latency or tight-cost real-time systems where many API calls are infeasible.
- Tasks without clear state/action language or where environment feedback is required per step.
- When you cannot add a validation/monitoring step to your pipeline.
Failure Modes
- Incorrect task decomposition (subgoal errors)
- No-progress actions or poor action proposals
- Loops and repeated states due to Actor/Evaluator mistakes
Core Entities
Models
- GPT-4
- Llama3-70B
Metrics
- % solved
- % invalid actions
- Accuracy
- avg plan steps
Datasets
- StrategyQA
- PlanBench (Logistics, Mystery Blocksworld)
- CogEval graph traversal tasks
- Tower of Hanoi (text reformulation)
Benchmarks
- PlanBench
- StrategyQA
- CogEval

