MAP: split planning into specialized LLM modules to get more reliable multi-step plans

September 30, 20238 min

Overview

Decision SnapshotReady For Pilot

MAP shows clear empirical gains on several planning tasks and ablations support module roles, but it is computationally costly and depends on prompt-based specialization rather than fine-tuning.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 7/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 50%

Novelty: 60%

Authors

Taylor Webb, Shanka Subhra Mondal, Ida Momennejad

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product needs reliable multi-step decisions, splitting planning into specialized LLM modules reduces incorrect actions and improves transferability; you can also trade off accuracy and cost by using smaller models and caching.

Who Should Care

Summary TLDR

The paper introduces MAP, a Modular Agentic Planner that composes multiple small LLM-based modules (TaskDecomposer, Actor, Monitor, Predictor, Evaluator, Orchestrator) that interact recurrently to produce and check plans. On several benchmarks—graph traversal (CogEval tasks), Tower of Hanoi (ToH), PlanBench, and StrategyQA—MAP improves correctness and dramatically reduces invalid actions versus single-run LLM prompting and other multi-agent/tree search baselines. MAP can run with a smaller LLM (Llama3-70B) and benefits from caching to reduce cost, but it is computationally expensive and still fails on some hard cases.

Problem Statement

LLMs are strong at single-step outputs but fail on goal-directed, multi-step planning: they hallucinate invalid actions, loop, or lose track of multi-step consequences. The paper asks whether planning improves if planning functions are split into specialized LLM modules that propose actions, predict next states, evaluate outcomes, monitor validity, decompose goals, and orchestrate progress.

Main Contribution

Design of MAP, a modular agentic planner made of specialized LLM modules that interact recurrently to search at the level of states.

Implementation details: each module is an LLM prompt + ≤3 few-shot examples; tree search at state level (B=2, L=2) and action-filtering via a Monitor.

Key Findings

MAP solved the Valuepath graph task on evaluated problems

Numbers100% solved (Valuepath, Table 4)

Practical UseUse MAP's modular loop plus Monitor to eliminate invalid moves on small navigation-style planning tasks.

Evidence RefTable 4, Figure 2

On 3-disk Tower of Hanoi, MAP solved more problems than baselines

Numbers74% solved vs GPT-4 ICL 46% (Table 8)

Practical UseA modular agent with tree search and a TaskDecomposer helps LLMs reach multi-step goals more often in constrained puzzles.

Evidence RefTable 8, Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Valuepath % solved (MAP)100%GPT-4 ICL 91%, GPT-4 zero-shot 54%+9% vs ICLValuepath (graph traversal)Table 4 (Valuepath)Table 4
Steppath % solved (MAP)100% (2- & 3-step), 95% (4-step)GPT-4 CoT 95%/79%/47%up to +48% vs best baseline on 4-stepSteppath (2/3/4-step)Table 5, Figure 2Table 5

What To Try In 7 Days

Prototype a Monitor module to check and reject invalid actions in an existing LLM pipeline.

Add a simple TaskDecomposer to break a multi-step task into subgoals and compare error rates.

Run MAP-style Actor+Predictor+Evaluator loop with a small model (Llama3-70B) and cache repeated module outputs to estimate cost vs. gain.

Agent Features

Memory
short-term caching of module outputs
Planning
task decompositiontree searchaction proposal loopgoal orchestration
Frameworks
prompting + few-shot in-context learning
Is Agentic

Yes

Architectures
modular multi-agentrecurrent module interactionstate-level tree search
Collaboration
multi-agent interaction between specialized modules

Optimization Features

Token Efficiency
L=2 depth recommended; L=3 had marginal gains with much higher tokens (Table 14)
Infra Optimization
high API-call throughput and token budget required; caching reduces calls (Table 16)
Model Optimization
works with smaller Llama3-70B model
System Optimization
separate modules to enable selective caching and parallel calls
Inference Optimization
cache and reuse module outputs to cut costAccuracy

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

StrategyQAPlanBenchCogEval tasks

Risks & Boundaries

Limitations

High computational and token cost compared with single-call prompting (see Table 16).

Performance still suboptimal on some tasks (Reward Revaluation, large PlanBench problems, ToH 4-disk OOD).

When Not To Use

Low-latency or tight-cost real-time systems where many API calls are infeasible.

Tasks without clear state/action language or where environment feedback is required per step.

Failure Modes

Incorrect task decomposition (subgoal errors)

No-progress actions or poor action proposals

Core Entities

Models

GPT-4Llama3-70B

Metrics

% solved% invalid actionsAccuracyavg plan steps

Datasets

StrategyQAPlanBench (Logistics, Mystery Blocksworld)CogEval graph traversal tasksTower of Hanoi (text reformulation)

Benchmarks

PlanBenchStrategyQACogEval