ReAct's gains come from example-task similarity, not true stepwise reasoning

May 22, 20246 min

Overview

Production Readiness

0.3

Novelty Score

0.4

Cost Impact Score

0.2

Citation Count

5

Authors

Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati

Links

Abstract / PDF

Why It Matters For Business

If you use ReAct-style prompts to power agentic workflows, expect brittle behavior: gains often come from near-identical examples, not true planning, which limits scalability and reliability.

Summary TLDR

This paper tests why ReAct prompting appears to help LLM agents on planning tasks. Through controlled prompt variations on AlfWorld (GPT-3.5, GPT-4, Claude), the authors show that (1) interleaving 'think' traces with actions is not the main reason for gains, (2) the specific content of thought traces (even placebo or hindsight notes) often performs similarly, and (3) performance strongly depends on having exemplar problems that closely match the query. The practical takeaway: ReAct often works because it exposes near-matching examples (contextual retrieval), not because the model is truly planning step-by-step.

Problem Statement

ReAct and similar prompting methods claim to boost LLM planning by interleaving reasoning traces with actions. The paper asks which parts of ReAct actually cause the gains: the interleaving of reasoning and actions, the content of the reasoning trace, or the similarity between example problems and the query.

Main Contribution

Systematic sensitivity study that breaks ReAct into three components: think-action interleaving, content of reasoning traces, and exemplar-query similarity.

Extensive experiments on AlfWorld across GPT-3.5-turbo, GPT-3.5-instruct, GPT-4, and Claude-Opus using controlled prompt variations.

Demonstration that exemplar-query similarity explains most performance gains; interleaving and rich reasoning traces are not necessary.

Quantified failure modes: minor prompt changes (synonyms, different exemplar tasks) cause major performance drops.

Key Findings

Interleaving reasoning with actions is not necessary for better performance.

NumbersGPT-3.5-Turbo: 27.6% → 46.6%; GPT-3.5-Instruct: 44.7% → 61.9% (Base → Exemplar-CoT)

The exact content of the reasoning trace has limited effect; weak or placebo guidance often matches or improves outcomes.

NumbersGPT-3.5-Turbo: Base 27.6% vs Failure 43.3%; placebo (Magic) 30% (Table 1 RQ2)

Performance collapses when exemplars differ even slightly from the query task.

NumbersGPT-3.5-Turbo Domain (synonyms): 1.6% vs Base 27.6%; 'Both' exemplars (different task): single-digit success (Table 2 RQ

LLM 'thoughts' often don't translate to correct actions.

NumbersInvalid post-think actions: GPT-3.5-Instruct ~40%; GPT-3.5-Turbo ~80%; Claude-Haiku ~90% (manual analysis)

Results

Base vs Exemplar-CoT success (example)

ValueGPT-3.5-Turbo: Base 27.6% → Exemplar-CoT 46.6%

BaselineBase ReAct prompt

Effect of placebo/hindsight guidance

ValueGPT-3.5-Turbo: Failure 43.3% vs Base 27.6%

BaselineBase ReAct prompt

Collapse with synonym-domain exemplar

ValueGPT-3.5-Turbo: Domain 1.6% vs Base 27.6%

BaselineBase ReAct prompt

Who Should Care

What To Try In 7 Days

Run prompt sensitivity tests: vary exemplar similarity, ordering, and placeholder guidance and measure success.

Avoid assuming internal 'thoughts' are actionable; add action validation or constrained decoding.

If you need generalization, invest in retrieval, fine-tuning, or explicit planning modules rather than hand-curated exemplars.

Agent Features

Memory

  • short-term context window (few-shot exemplars)

Planning

  • reasoning trace interleaving
  • exemplar-based plan guidance

Tool Use

  • API prompting (OpenAI/Claude)

Frameworks

  • ReAct
  • Chain-of-Thought

Is Agentic

true

Architectures

  • LLM-based agent (few-shot prompting)

Reproducibility

Data Urls

  • AlfWorld (mentioned; public domain dataset)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments confined to AlfWorld (a synthetic PDDL household domain).
  • Different sample sizes: GPT-3.5 models run on 134 instances, GPT-4/Claude on 60 instances.
  • Code availability asserted via supplementary material but no public repo URL given in text.
  • Results may not directly transfer to non-PDDL or multi-modal agent environments.

When Not To Use

  • Don't assume ReAct will enable cross-task planning without curated exemplars.
  • Avoid using free-form thought traces as a safety or correctness check for actions.
  • Don't rely on ReAct for large-scale agent deployment where exemplar curation is infeasible.

Failure Modes

  • Performance collapses when exemplar vocabulary or task differs from the query.
  • Model generates valid-looking thoughts that lead to invalid or nonsensical actions.
  • Small syntactic changes or instruction tags can drastically reduce success rate.

Core Entities

Models

  • gpt-3.5-turbo
  • gpt-3.5-turbo-instruct
  • gpt-4
  • claude-3-opus
  • claude-3-sonnet
  • claude-3-haiku

Metrics

  • Success Rate (%)
  • Failure Rate (%)
  • Invalid-action-after-think (%)

Datasets

  • AlfWorld (text-based PDDL planning domain)