ReWOO separates planning from fetching evidence to cut repeating prompt tokens and run smaller models

May 23, 20238 min

Overview

Decision SnapshotReady For Pilot

The idea is simple and practical: batch planning then fetch evidence, which avoids re-sending long context repeatedly; experiments on multiple benchmarks and ablations back this up.

Citations15

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, Dongkuan Xu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.

Who Should Care

Summary TLDR

ReWOO is a modular prompting pattern for tool-augmented language systems that splits work into Planner (make plans), Worker (call tools and collect evidence), and Solver (use plans+evidence to answer). This avoids the common loop of thought→tool→observation→repeat, reducing repeated context tokens. On six benchmarks ReWOO cut token use massively (≈64% average) while matching or slightly improving accuracy. It also lets you fine-tune a small 7B Planner to emulate reasoning from a much larger model, enabling lighter deployments and better robustness when tools fail.

Problem Statement

Current augmented language models (ALMs) interleave reasoning and tool calls. Each tool response forces the LLM to be re-invoked with the entire history, causing quadratic growth in prompt tokens, high API cost, and slow execution. The paper asks: can we separate reasoning from observations to save tokens while keeping or improving task performance?

Main Contribution

Identify 'foreseeable reasoning' — LLMs can plan plausible next steps without immediate tool observations, enabling prompt-efficient workflows.

Design ReWOO, a Plan-Work-Solve modular paradigm that decouples planning, external tool calls, and final solving to avoid repeating long prompts.

Key Findings

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

NumbersReAct 9795.1 tokens vs ReWOO 1986.2 tokens (HotpotQA)

Practical UseUse ReWOO to cut API/token costs on multi-step QA workloads; expect roughly 4–5× fewer tokens for HotpotQA-like tasks.

Evidence RefTable 2 (HotpotQA token counts)

Averaged over six public benchmarks, ReWOO cut input tokens by ~64% and raised absolute accuracy by ~4.4% versus ReAct-like ALMs.

NumbersAverage token reduction 64%; accuracy +4.4% (six benchmarks)

Practical UseFor mixed multi-step NLP workloads, switching to ReWOO can lower inference cost and often improve or maintain accuracy on evaluated datasets.

Evidence RefSection 3.2 and Table 2 averages

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HotpotQA tokensReAct 9795.1 → ReWOO 1986.2ReAct~5× reductionHotpotQA (1000 examples)Table 2 token countsTable 2
AccuracyReAct 40.8 → ReWOO 42.4ReAct+1.6 absoluteHotpotQA (1000 examples)Table 2 accuracyTable 2

What To Try In 7 Days

Prototype a Planner/Worker/Solver split for an existing tool-augmented QA flow.

Measure token usage per query before/after decoupling planning from tool calls.

Fine-tune a small Planner on a few hundred planning examples to move planning off a large API.

Agent Features

Memory
Short-term evidence slots (#E1, #E2 ...) to pass observations
Planning
Step decomposition into explicit plansForeseeable reasoning: predict evidence-free outcomes
Tool Use
Designated Workers call external tools (search, calculator, LLMs)Workers populate evidence variables (#E)
Frameworks
ReWOO
Is Agentic

Yes

Architectures
Plan-Work-Solve modular pipeline
Collaboration
Planner issues plans; Worker executes tool calls; Solver integrates plans+evidence

Optimization Features

Token Efficiency
Empirical ~64% average token reduction≈5× token efficiency on HotpotQA
Infra Optimization
Enable running planning on local 7B model to lower API/hosting costs
Model Optimization
Offload planning to a smaller specialized model (Planner 7B)
System Optimization
Decouple parametric LLM parts from non-parametric tool calls for modular updates
Training Optimization
Instruction fine-tuning and specialization on Planner dataLoRA
Inference Optimization
Reduce repeated prompt context by batching plans and evidence into single Solver call

Reproducibility

Risks & Boundaries

Limitations

When the environment state is unknown and planning would require enumerating many possibilities, foreseeable reasoning can be impractical (AlfWorld example).

Adding many irrelevant tools in context can harm performance via tool misuse.

When Not To Use

Interactive embodied tasks where the planner lacks prior environment info and must act adaptively.

Workflows that require immediate observation-dependent branching at every step.

Failure Modes

Tool misuse: workers invoked on wrong tools produce irrelevant evidence.

Solver mistakes: final synthesis step draws wrong conclusion despite valid evidence.

Core Entities

Models

gpt-3.5-turbotext-davinci-003LLaMA-7BAlpaca-7BPlanner_7B

Metrics

AccuracyF1Exact Match (EM)Total tokens# reasoning stepsCost per 1k queries (USD)

Datasets

HotpotQATriviaQAGSM8KStrategyQAPhysicsQuestionsSportsUnderstandingSOTUQA (curated)

Benchmarks

HotpotQATriviaQAGSM8KStrategyQAPhysicsQuestionsSportsUnderstandingSOTUQA

Context Entities

Models

Auto-GPT (mentioned)Toolformer (prior work)

Datasets

Star/self-instruct style data (used to bootstrap planners)