ReWOO separates planning from fetching evidence to cut repeating prompt tokens and run smaller models

May 23, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

15

Authors

Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, Dongkuan Xu

Links

Abstract / PDF

Why It Matters For Business

ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.

Summary TLDR

ReWOO is a modular prompting pattern for tool-augmented language systems that splits work into Planner (make plans), Worker (call tools and collect evidence), and Solver (use plans+evidence to answer). This avoids the common loop of thought→tool→observation→repeat, reducing repeated context tokens. On six benchmarks ReWOO cut token use massively (≈64% average) while matching or slightly improving accuracy. It also lets you fine-tune a small 7B Planner to emulate reasoning from a much larger model, enabling lighter deployments and better robustness when tools fail.

Problem Statement

Current augmented language models (ALMs) interleave reasoning and tool calls. Each tool response forces the LLM to be re-invoked with the entire history, causing quadratic growth in prompt tokens, high API cost, and slow execution. The paper asks: can we separate reasoning from observations to save tokens while keeping or improving task performance?

Main Contribution

Identify 'foreseeable reasoning' — LLMs can plan plausible next steps without immediate tool observations, enabling prompt-efficient workflows.

Design ReWOO, a Plan-Work-Solve modular paradigm that decouples planning, external tool calls, and final solving to avoid repeating long prompts.

Show that specializing/offloading planning into a small 7B model can reproduce much of a large LLM's planning ability, enabling parameter-efficient ALMs.

Key Findings

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

NumbersReAct 9795.1 tokens vs ReWOO 1986.2 tokens (HotpotQA)

Averaged over six public benchmarks, ReWOO cut input tokens by ~64% and raised absolute accuracy by ~4.4% versus ReAct-like ALMs.

NumbersAverage token reduction 64%; accuracy +4.4% (six benchmarks)

ReWOO enables offloading planning ability into a small 7B model that performs comparably to a much larger GPT-3.5 planner on several tasks.

NumbersPlanner 7B (7B) approximates GPT-3.5 (~175B) on HotpotQA/TriviaQA/StrategyQA (paper cites ~25× size gap)

ReWOO degrades less than ReAct when tools fail: the accuracy drop with forced 'No evidence found' responses is smaller for ReWOO.

NumbersHotpotQA tool-failure accuracy change: ReAct −40.8 vs ReWOO −29.2

Results

HotpotQA tokens

ValueReAct 9795.1 → ReWOO 1986.2

BaselineReAct

Accuracy

ValueReAct 40.8 → ReWOO 42.4

BaselineReAct

Average token reduction

Value64% fewer input tokens

BaselineReAct-like observation-dependent ALM

Accuracy

ValueReAct 64.8% acc, 1840.3 tokens → ReWOO 70.2% acc, 1048.8 tokens

BaselineReAct

Who Should Care

What To Try In 7 Days

Prototype a Planner/Worker/Solver split for an existing tool-augmented QA flow.

Measure token usage per query before/after decoupling planning from tool calls.

Fine-tune a small Planner on a few hundred planning examples to move planning off a large API.

Agent Features

Memory

  • Short-term evidence slots (#E1, #E2 ...) to pass observations

Planning

  • Step decomposition into explicit plans
  • Foreseeable reasoning: predict evidence-free outcomes

Tool Use

  • Designated Workers call external tools (search, calculator, LLMs)
  • Workers populate evidence variables (#E)

Frameworks

  • ReWOO

Is Agentic

true

Architectures

  • Plan-Work-Solve modular pipeline

Collaboration

  • Planner issues plans; Worker executes tool calls; Solver integrates plans+evidence

Optimization Features

Token Efficiency

  • Empirical ~64% average token reduction
  • ≈5× token efficiency on HotpotQA

Infra Optimization

  • Enable running planning on local 7B model to lower API/hosting costs

Model Optimization

  • Offload planning to a smaller specialized model (Planner 7B)

System Optimization

  • Decouple parametric LLM parts from non-parametric tool calls for modular updates

Training Optimization

  • Instruction fine-tuning and specialization on Planner data
  • LoRA

Inference Optimization

  • Reduce repeated prompt context by batching plans and evidence into single Solver call

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • When the environment state is unknown and planning would require enumerating many possibilities, foreseeable reasoning can be impractical (AlfWorld example).
  • Adding many irrelevant tools in context can harm performance via tool misuse.
  • Solver can still make wrong final inferences even with correct plans and evidence.

When Not To Use

  • Interactive embodied tasks where the planner lacks prior environment info and must act adaptively.
  • Workflows that require immediate observation-dependent branching at every step.
  • Situations where tooling is highly unreliable and evidence is often missing.

Failure Modes

  • Tool misuse: workers invoked on wrong tools produce irrelevant evidence.
  • Solver mistakes: final synthesis step draws wrong conclusion despite valid evidence.
  • Token-limit loops in observation-dependent baselines (contrast), or large enumerations for planners in low-context settings.

Core Entities

Models

  • gpt-3.5-turbo
  • text-davinci-003
  • LLaMA-7B
  • Alpaca-7B
  • Planner_7B

Metrics

  • Accuracy
  • F1
  • Exact Match (EM)
  • Total tokens
  • # reasoning steps
  • Cost per 1k queries (USD)

Datasets

  • HotpotQA
  • TriviaQA
  • GSM8K
  • StrategyQA
  • PhysicsQuestions
  • SportsUnderstanding
  • SOTUQA (curated)

Benchmarks

  • HotpotQA
  • TriviaQA
  • GSM8K
  • StrategyQA
  • PhysicsQuestions
  • SportsUnderstanding
  • SOTUQA

Context Entities

Models

  • Auto-GPT (mentioned)
  • Toolformer (prior work)

Datasets

  • Star/self-instruct style data (used to bootstrap planners)