MAP: split planning into specialized LLM modules to get more reliable multi-step plans

September 30, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.25

Citation Count

3

Authors

Taylor Webb, Shanka Subhra Mondal, Ida Momennejad

Links

Abstract / PDF

Why It Matters For Business

If your product needs reliable multi-step decisions, splitting planning into specialized LLM modules reduces incorrect actions and improves transferability; you can also trade off accuracy and cost by using smaller models and caching.

Summary TLDR

The paper introduces MAP, a Modular Agentic Planner that composes multiple small LLM-based modules (TaskDecomposer, Actor, Monitor, Predictor, Evaluator, Orchestrator) that interact recurrently to produce and check plans. On several benchmarks—graph traversal (CogEval tasks), Tower of Hanoi (ToH), PlanBench, and StrategyQA—MAP improves correctness and dramatically reduces invalid actions versus single-run LLM prompting and other multi-agent/tree search baselines. MAP can run with a smaller LLM (Llama3-70B) and benefits from caching to reduce cost, but it is computationally expensive and still fails on some hard cases.

Problem Statement

LLMs are strong at single-step outputs but fail on goal-directed, multi-step planning: they hallucinate invalid actions, loop, or lose track of multi-step consequences. The paper asks whether planning improves if planning functions are split into specialized LLM modules that propose actions, predict next states, evaluate outcomes, monitor validity, decompose goals, and orchestrate progress.

Main Contribution

Design of MAP, a modular agentic planner made of specialized LLM modules that interact recurrently to search at the level of states.

Implementation details: each module is an LLM prompt + ≤3 few-shot examples; tree search at state level (B=2, L=2) and action-filtering via a Monitor.

Empirical evaluation across four domains: graph traversal (CogEval), Tower of Hanoi, two PlanBench domains, and StrategyQA, showing better % solved, fewer invalid actions, transfer gains, and useful ablations.

Ablation shows Monitor is critical; MAP still works with smaller LLM (Llama3-70B); caching cuts API/token cost substantially.

Key Findings

MAP solved the Valuepath graph task on evaluated problems

Numbers100% solved (Valuepath, Table 4)

On 3-disk Tower of Hanoi, MAP solved more problems than baselines

Numbers74% solved vs GPT-4 ICL 46% (Table 8)

MAP produces far fewer invalid actions (filters hallucinations)

Numbers0% invalid actions on many tasks (ToH 3-disk %invalid = 0, Table 8)

MAP transfers better between planning problems

Numbers80% transfer success (n7tree→n15star) vs GPT-4 ICL 51% and CoT 65% (Table 3)

MAP can be implemented with a smaller LLM and still help

NumbersLlama3-70B MAP: 50% solved ToH 3-disk vs GPT-4 ICL 46% (Table 10)

Results

Valuepath % solved (MAP)

Value100%

BaselineGPT-4 ICL 91%, GPT-4 zero-shot 54%

Steppath % solved (MAP)

Value100% (2- & 3-step), 95% (4-step)

BaselineGPT-4 CoT 95%/79%/47%

Tower of Hanoi % solved (3-disk)

Value74%

BaselineGPT-4 ICL 46%

Tower of Hanoi % invalid actions

Value0%

BaselineGPT-4 ICL 12%

PlanBench (subset Logistics) % solved

Value53.3% (subset of 30 problems)

BaselineToT 10.4%

Accuracy

Value84.7% ± 0.3

BaselineCoT 87.7% ± 0.7 (table shows mixed baselines)

Transfer n7tree→n15star % solved

Value80%

BaselineGPT-4 ICL 51%, GPT-4 CoT 65%

Who Should Care

What To Try In 7 Days

Prototype a Monitor module to check and reject invalid actions in an existing LLM pipeline.

Add a simple TaskDecomposer to break a multi-step task into subgoals and compare error rates.

Run MAP-style Actor+Predictor+Evaluator loop with a small model (Llama3-70B) and cache repeated module outputs to estimate cost vs. gain.

Agent Features

Memory

  • short-term caching of module outputs

Planning

  • task decomposition
  • tree search
  • action proposal loop
  • goal orchestration

Frameworks

  • prompting + few-shot in-context learning

Is Agentic

true

Architectures

  • modular multi-agent
  • recurrent module interaction
  • state-level tree search

Collaboration

  • multi-agent interaction between specialized modules

Optimization Features

Token Efficiency

  • L=2 depth recommended; L=3 had marginal gains with much higher tokens (Table 14)

Infra Optimization

  • high API-call throughput and token budget required; caching reduces calls (Table 16)

Model Optimization

  • works with smaller Llama3-70B model

System Optimization

  • separate modules to enable selective caching and parallel calls

Inference Optimization

  • cache and reuse module outputs to cut cost
  • Accuracy

Reproducibility

Data Urls

  • StrategyQA
  • PlanBench
  • CogEval tasks

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High computational and token cost compared with single-call prompting (see Table 16).
  • Performance still suboptimal on some tasks (Reward Revaluation, large PlanBench problems, ToH 4-disk OOD).
  • Relies on prompting/few-shot to specialize modules; may improve with joint fine-tuning of module models.

When Not To Use

  • Low-latency or tight-cost real-time systems where many API calls are infeasible.
  • Tasks without clear state/action language or where environment feedback is required per step.
  • When you cannot add a validation/monitoring step to your pipeline.

Failure Modes

  • Incorrect task decomposition (subgoal errors)
  • No-progress actions or poor action proposals
  • Loops and repeated states due to Actor/Evaluator mistakes

Core Entities

Models

  • GPT-4
  • Llama3-70B

Metrics

  • % solved
  • % invalid actions
  • Accuracy
  • avg plan steps

Datasets

  • StrategyQA
  • PlanBench (Logistics, Mystery Blocksworld)
  • CogEval graph traversal tasks
  • Tower of Hanoi (text reformulation)

Benchmarks

  • PlanBench
  • StrategyQA
  • CogEval