ReWOO separates planning from fetching evidence to cut repeating prompt tokens and run smaller models

Overview

Decision SnapshotReady For Pilot

The idea is simple and practical: batch planning then fetch evidence, which avoids re-sending long context repeatedly; experiments on multiple benchmarks and ablations back this up.

Citations15

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, Dongkuan Xu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

ReWOO is a modular prompting pattern for tool-augmented language systems that splits work into Planner (make plans), Worker (call tools and collect evidence), and Solver (use plans+evidence to answer). This avoids the common loop of thought→tool→observation→repeat, reducing repeated context tokens. On six benchmarks ReWOO cut token use massively (≈64% average) while matching or slightly improving accuracy. It also lets you fine-tune a small 7B Planner to emulate reasoning from a much larger model, enabling lighter deployments and better robustness when tools fail.

Problem Statement

Current augmented language models (ALMs) interleave reasoning and tool calls. Each tool response forces the LLM to be re-invoked with the entire history, causing quadratic growth in prompt tokens, high API cost, and slow execution. The paper asks: can we separate reasoning from observations to save tokens while keeping or improving task performance?

Main Contribution

Identify 'foreseeable reasoning' — LLMs can plan plausible next steps without immediate tool observations, enabling prompt-efficient workflows.

Design ReWOO, a Plan-Work-Solve modular paradigm that decouples planning, external tool calls, and final solving to avoid repeating long prompts.

Key Findings

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

NumbersReAct 9795.1 tokens vs ReWOO 1986.2 tokens (HotpotQA)

Practical UseUse ReWOO to cut API/token costs on multi-step QA workloads; expect roughly 4–5× fewer tokens for HotpotQA-like tasks.

Evidence RefTable 2 (HotpotQA token counts)

Averaged over six public benchmarks, ReWOO cut input tokens by ~64% and raised absolute accuracy by ~4.4% versus ReAct-like ALMs.

NumbersAverage token reduction 64%; accuracy +4.4% (six benchmarks)

Practical UseFor mixed multi-step NLP workloads, switching to ReWOO can lower inference cost and often improve or maintain accuracy on evaluated datasets.

Evidence RefSection 3.2 and Table 2 averages

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HotpotQA tokens	ReAct 9795.1 → ReWOO 1986.2	ReAct	~5× reduction	HotpotQA (1000 examples)	Table 2 token counts	Table 2
Accuracy	ReAct 40.8 → ReWOO 42.4	ReAct	+1.6 absolute	HotpotQA (1000 examples)	Table 2 accuracy	Table 2

What To Try In 7 Days

Prototype a Planner/Worker/Solver split for an existing tool-augmented QA flow.

Measure token usage per query before/after decoupling planning from tool calls.

Fine-tune a small Planner on a few hundred planning examples to move planning off a large API.

Agent Features

Memory

Short-term evidence slots (#E1, #E2 ...) to pass observations

Planning

Step decomposition into explicit plansForeseeable reasoning: predict evidence-free outcomes

Tool Use

Designated Workers call external tools (search, calculator, LLMs)Workers populate evidence variables (#E)

Frameworks

ReWOO

Is Agentic

Yes

Architectures

Plan-Work-Solve modular pipeline

Collaboration

Planner issues plans; Worker executes tool calls; Solver integrates plans+evidence

Optimization Features

Token Efficiency

Empirical ~64% average token reduction≈5× token efficiency on HotpotQA

Infra Optimization

Enable running planning on local 7B model to lower API/hosting costs

Model Optimization

Offload planning to a smaller specialized model (Planner 7B)

System Optimization

Decouple parametric LLM parts from non-parametric tool calls for modular updates

Training Optimization

Instruction fine-tuning and specialization on Planner dataLoRA

Inference Optimization

Reduce repeated prompt context by batching plans and evidence into single Solver call

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/billxbf/ReWOO https://huggingface.co/rewoo/planner_7B

Data URLs

https://github.com/billxbf/ReWOO (curated SOTUQA and trajectories referenced)

Risks & Boundaries

Limitations

When the environment state is unknown and planning would require enumerating many possibilities, foreseeable reasoning can be impractical (AlfWorld example).

Adding many irrelevant tools in context can harm performance via tool misuse.

When Not To Use

Interactive embodied tasks where the planner lacks prior environment info and must act adaptively.

Workflows that require immediate observation-dependent branching at every step.

Failure Modes

Tool misuse: workers invoked on wrong tools produce irrelevant evidence.

Solver mistakes: final synthesis step draws wrong conclusion despite valid evidence.

Core Entities

Models

gpt-3.5-turbotext-davinci-003LLaMA-7BAlpaca-7BPlanner_7B

Metrics

AccuracyF1Exact Match (EM)Total tokens# reasoning stepsCost per 1k queries (USD)

Datasets

HotpotQATriviaQAGSM8KStrategyQAPhysicsQuestionsSportsUnderstandingSOTUQA (curated)

Benchmarks

HotpotQATriviaQAGSM8KStrategyQAPhysicsQuestionsSportsUnderstandingSOTUQA

Context Entities

Models

Auto-GPT (mentioned)Toolformer (prior work)

Datasets

Star/self-instruct style data (used to bootstrap planners)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

Averaged over six public benchmarks, ReWOO cut input tokens by ~64% and raised absolute accuracy by ~4.4% versus ReAct-like ALMs.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding