MAP: split planning into specialized LLM modules to get more reliable multi-step plans

Overview

Decision SnapshotReady For Pilot

MAP shows clear empirical gains on several planning tasks and ablations support module roles, but it is computationally costly and depends on prompt-based specialization rather than fine-tuning.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 7/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 50%

Novelty: 60%

Authors

Taylor Webb, Shanka Subhra Mondal, Ida Momennejad

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product needs reliable multi-step decisions, splitting planning into specialized LLM modules reduces incorrect actions and improves transferability; you can also trade off accuracy and cost by using smaller models and caching.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper introduces MAP, a Modular Agentic Planner that composes multiple small LLM-based modules (TaskDecomposer, Actor, Monitor, Predictor, Evaluator, Orchestrator) that interact recurrently to produce and check plans. On several benchmarks—graph traversal (CogEval tasks), Tower of Hanoi (ToH), PlanBench, and StrategyQA—MAP improves correctness and dramatically reduces invalid actions versus single-run LLM prompting and other multi-agent/tree search baselines. MAP can run with a smaller LLM (Llama3-70B) and benefits from caching to reduce cost, but it is computationally expensive and still fails on some hard cases.

Problem Statement

LLMs are strong at single-step outputs but fail on goal-directed, multi-step planning: they hallucinate invalid actions, loop, or lose track of multi-step consequences. The paper asks whether planning improves if planning functions are split into specialized LLM modules that propose actions, predict next states, evaluate outcomes, monitor validity, decompose goals, and orchestrate progress.

Main Contribution

Design of MAP, a modular agentic planner made of specialized LLM modules that interact recurrently to search at the level of states.

Implementation details: each module is an LLM prompt + ≤3 few-shot examples; tree search at state level (B=2, L=2) and action-filtering via a Monitor.

Key Findings

MAP solved the Valuepath graph task on evaluated problems

Numbers100% solved (Valuepath, Table 4)

Practical UseUse MAP's modular loop plus Monitor to eliminate invalid moves on small navigation-style planning tasks.

Evidence RefTable 4, Figure 2

On 3-disk Tower of Hanoi, MAP solved more problems than baselines

Numbers74% solved vs GPT-4 ICL 46% (Table 8)

Practical UseA modular agent with tree search and a TaskDecomposer helps LLMs reach multi-step goals more often in constrained puzzles.

Evidence RefTable 8, Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Valuepath % solved (MAP)	100%	GPT-4 ICL 91%, GPT-4 zero-shot 54%	+9% vs ICL	Valuepath (graph traversal)	Table 4 (Valuepath)	Table 4
Steppath % solved (MAP)	100% (2- & 3-step), 95% (4-step)	GPT-4 CoT 95%/79%/47%	up to +48% vs best baseline on 4-step	Steppath (2/3/4-step)	Table 5, Figure 2	Table 5

What To Try In 7 Days

Prototype a Monitor module to check and reject invalid actions in an existing LLM pipeline.

Add a simple TaskDecomposer to break a multi-step task into subgoals and compare error rates.

Run MAP-style Actor+Predictor+Evaluator loop with a small model (Llama3-70B) and cache repeated module outputs to estimate cost vs. gain.

Agent Features

Memory

short-term caching of module outputs

Planning

task decompositiontree searchaction proposal loopgoal orchestration

Frameworks

prompting + few-shot in-context learning

Is Agentic

Yes

Architectures

modular multi-agentrecurrent module interactionstate-level tree search

Collaboration

multi-agent interaction between specialized modules

Optimization Features

Token Efficiency

L=2 depth recommended; L=3 had marginal gains with much higher tokens (Table 14)

Infra Optimization

high API-call throughput and token budget required; caching reduces calls (Table 16)

Model Optimization

works with smaller Llama3-70B model

System Optimization

separate modules to enable selective caching and parallel calls

Inference Optimization

cache and reuse module outputs to cut costAccuracy

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/MAPLLM/MAPICLR2025sub

Data URLs

StrategyQAPlanBenchCogEval tasks

Risks & Boundaries

Limitations

High computational and token cost compared with single-call prompting (see Table 16).

Performance still suboptimal on some tasks (Reward Revaluation, large PlanBench problems, ToH 4-disk OOD).

When Not To Use

Low-latency or tight-cost real-time systems where many API calls are infeasible.

Tasks without clear state/action language or where environment feedback is required per step.

Failure Modes

Incorrect task decomposition (subgoal errors)

No-progress actions or poor action proposals

Core Entities

Models

GPT-4Llama3-70B

Metrics

% solved% invalid actionsAccuracyavg plan steps

Datasets

StrategyQAPlanBench (Logistics, Mystery Blocksworld)CogEval graph traversal tasksTower of Hanoi (text reformulation)

Benchmarks

PlanBenchStrategyQACogEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MAP solved the Valuepath graph task on evaluated problems

On 3-disk Tower of Hanoi, MAP solved more problems than baselines

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding