Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
15
Why It Matters For Business
ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.
Summary TLDR
ReWOO is a modular prompting pattern for tool-augmented language systems that splits work into Planner (make plans), Worker (call tools and collect evidence), and Solver (use plans+evidence to answer). This avoids the common loop of thought→tool→observation→repeat, reducing repeated context tokens. On six benchmarks ReWOO cut token use massively (≈64% average) while matching or slightly improving accuracy. It also lets you fine-tune a small 7B Planner to emulate reasoning from a much larger model, enabling lighter deployments and better robustness when tools fail.
Problem Statement
Current augmented language models (ALMs) interleave reasoning and tool calls. Each tool response forces the LLM to be re-invoked with the entire history, causing quadratic growth in prompt tokens, high API cost, and slow execution. The paper asks: can we separate reasoning from observations to save tokens while keeping or improving task performance?
Main Contribution
Identify 'foreseeable reasoning' — LLMs can plan plausible next steps without immediate tool observations, enabling prompt-efficient workflows.
Design ReWOO, a Plan-Work-Solve modular paradigm that decouples planning, external tool calls, and final solving to avoid repeating long prompts.
Show that specializing/offloading planning into a small 7B model can reproduce much of a large LLM's planning ability, enabling parameter-efficient ALMs.
Key Findings
ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).
Averaged over six public benchmarks, ReWOO cut input tokens by ~64% and raised absolute accuracy by ~4.4% versus ReAct-like ALMs.
ReWOO enables offloading planning ability into a small 7B model that performs comparably to a much larger GPT-3.5 planner on several tasks.
ReWOO degrades less than ReAct when tools fail: the accuracy drop with forced 'No evidence found' responses is smaller for ReWOO.
Results
HotpotQA tokens
Accuracy
Average token reduction
Accuracy
Who Should Care
What To Try In 7 Days
Prototype a Planner/Worker/Solver split for an existing tool-augmented QA flow.
Measure token usage per query before/after decoupling planning from tool calls.
Fine-tune a small Planner on a few hundred planning examples to move planning off a large API.
Agent Features
Memory
- Short-term evidence slots (#E1, #E2 ...) to pass observations
Planning
- Step decomposition into explicit plans
- Foreseeable reasoning: predict evidence-free outcomes
Tool Use
- Designated Workers call external tools (search, calculator, LLMs)
- Workers populate evidence variables (#E)
Frameworks
- ReWOO
Is Agentic
true
Architectures
- Plan-Work-Solve modular pipeline
Collaboration
- Planner issues plans; Worker executes tool calls; Solver integrates plans+evidence
Optimization Features
Token Efficiency
- Empirical ~64% average token reduction
- ≈5× token efficiency on HotpotQA
Infra Optimization
- Enable running planning on local 7B model to lower API/hosting costs
Model Optimization
- Offload planning to a smaller specialized model (Planner 7B)
System Optimization
- Decouple parametric LLM parts from non-parametric tool calls for modular updates
Training Optimization
- Instruction fine-tuning and specialization on Planner data
- LoRA
Inference Optimization
- Reduce repeated prompt context by batching plans and evidence into single Solver call
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- When the environment state is unknown and planning would require enumerating many possibilities, foreseeable reasoning can be impractical (AlfWorld example).
- Adding many irrelevant tools in context can harm performance via tool misuse.
- Solver can still make wrong final inferences even with correct plans and evidence.
When Not To Use
- Interactive embodied tasks where the planner lacks prior environment info and must act adaptively.
- Workflows that require immediate observation-dependent branching at every step.
- Situations where tooling is highly unreliable and evidence is often missing.
Failure Modes
- Tool misuse: workers invoked on wrong tools produce irrelevant evidence.
- Solver mistakes: final synthesis step draws wrong conclusion despite valid evidence.
- Token-limit loops in observation-dependent baselines (contrast), or large enumerations for planners in low-context settings.
Core Entities
Models
- gpt-3.5-turbo
- text-davinci-003
- LLaMA-7B
- Alpaca-7B
- Planner_7B
Metrics
- Accuracy
- F1
- Exact Match (EM)
- Total tokens
- # reasoning steps
- Cost per 1k queries (USD)
Datasets
- HotpotQA
- TriviaQA
- GSM8K
- StrategyQA
- PhysicsQuestions
- SportsUnderstanding
- SOTUQA (curated)
Benchmarks
- HotpotQA
- TriviaQA
- GSM8K
- StrategyQA
- PhysicsQuestions
- SportsUnderstanding
- SOTUQA
Context Entities
Models
- Auto-GPT (mentioned)
- Toolformer (prior work)
Datasets
- Star/self-instruct style data (used to bootstrap planners)

