Make long-step reasoning models ask the web when they’re unsure and inject concise, refined facts back into the chain

January 9, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou

Links

Abstract / PDF

Why It Matters For Business

Search-o1 lets deployed reasoning models fetch and condense live web facts as they reason, raising accuracy on complex, multi-step queries while cutting noise from raw documents.

Summary TLDR

Search-o1 is a system that lets long chain-of-thought reasoning models (o1-style LRMs) call a web search when they hit uncertain facts, then runs a separate "Reason-in-Documents" step to compress and inject only the relevant facts back into the reasoning chain. On multiple hard reasoning and open-domain QA benchmarks, Search-o1 improves accuracy over direct reasoning, standard RAG, and an agentic-RAG baseline. The authors release code and show the method can beat expert-level scores on a graduate-level science benchmark in aggregate. Key trade-offs: extra latency and dependence on web search quality, but much lower document noise thanks to the refinement step.

Problem Statement

Large reasoning models that generate long step-by-step chains often hit local knowledge gaps. Single-shot retrieval (standard RAG) is too coarse because each reasoning step may need different facts, and directly inserting long web documents breaks chain coherence. The problem: enable LRMs to retrieve knowledge on-demand during generation and inject only concise, step-relevant facts without disrupting coherence.

Main Contribution

Search-o1: an agentic RAG framework that lets the reasoning model generate search queries mid-chain to retrieve external documents on demand.

Reason-in-Documents: a separate refinement module that reads retrieved pages and produces concise, step-focused knowledge to insert back into the chain.

Batch inference and evaluation showing improvements across five complex reasoning domains and six open QA benchmarks, including comparisons to human experts on GPQA.

Key Findings

Search-o1 improves average accuracy across five complex reasoning datasets compared to agentic-RAG and direct reasoning.

Numbersavg +4.7% vs RAgent-QwQ-32B; +3.1% vs QwQ-32B

On the GPQA extended set, Search-o1 reaches 57.9 overall, outscoring expert groups in aggregate.

NumbersSearch-o1 57.9 vs best human-expert group 48.9

Agentic RAG combined with Reason-in-Documents needs far fewer documents to help reasoning than standard RAG.

Numbers1 retrieved doc (agentic+refine) outperforms standard RAG using 10 docs (reported)

Search-o1 yields larger relative gains on multi-hop and complex QA than on single-hop QA.

Numbersmulti-hop EM: +29.6% vs RAG-QwQ-32B (average EM improvement reported)

Results

Pass@1 (GPQA overall)

Value63.6

BaselineQwQ-32B direct 58.1; RAgent-QwQ-32B 61.6

Pass@1 (MATH500)

Value86.4

BaselineQwQ-32B direct 83.2; RAgent-QwQ-32B 85.0

GPQA extended overall

Value57.9

BaselineHuman experts best 48.9 (chemists)

Multi-hop QA (average EM)

ValueSearch-o1 > baselines

BaselineRAG-QwQ-32B; RAgent-QwQ-32B

Who Should Care

What To Try In 7 Days

Run Search-o1’s inference loop on a small LRM instance to compare direct reasoning vs agentic retrieval on your task.

Enable a Reason-in-Documents step that summarizes only step-relevant facts before appending them to the chain.

Measure token and API costs when switching from standard RAG (top-10 docs) to agentic+refine with 1–3 docs.

Agent Features

Memory

  • short-term retrieval context injected into chain

Planning

  • agentic search query generation
  • decision when to pause generation for retrieval

Tool Use

  • web search (Bing API)
  • URL fetch via Jina Reader

Frameworks

  • Search-o1
  • Reason-in-Documents

Is Agentic

true

Architectures

  • o1-like long-chain reasoning LRM

Optimization Features

Token Efficiency

  • refinement reduces verbose documents before inserting into chain

System Optimization

  • batch retrieval and batch Reason-in-Documents processing

Inference Optimization

  • batch inference for parallel query extraction and refinement
  • retrieve-on-demand to avoid unnecessary documents

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Depends on third-party web search quality and availability (Bing API used in experiments).
  • Adds latency and API cost compared to offline direct reasoning.
  • Evaluations use a single LRM backbone (QwQ-32B-Preview) for most ablations; transfer to other LRMs may vary.
  • Refinement uses the same LRM, so errors in document analysis can still propagate.

When Not To Use

  • When web access is restricted or data privacy forbids external queries.
  • When low latency is critical and extra retrieval/refinement latency is unacceptable.
  • For simple single-hop lookups where standard RAG or cached answers suffice.

Failure Modes

  • Retrieved pages contain incorrect or misleading facts that the refinement step mis-parses and injects.
  • Search query generation misses the correct query, leading to irrelevant retrievals.
  • Over-reliance on web sources causes brittle behavior when the web has conflicting information.

Core Entities

Models

  • QwQ-32B-Preview
  • Qwen2.5-32B-Instruct
  • Qwen2.5-72B-Instruct
  • Llama3.3-70B-Instruct
  • o1-preview
  • GPT-4o
  • DeepSeek-R1-Lite

Metrics

  • Pass@1
  • Exact Match (EM)
  • F1

Datasets

  • GPQA
  • MATH500
  • AMC2023
  • AIME2024
  • LiveCodeBench
  • NaturalQuestions
  • TriviaQA
  • HotpotQA
  • 2WIKI
  • MuSiQue
  • Bamboogle

Benchmarks

  • GPQA (diamond & extended)
  • MATH500
  • AMC2023
  • AIME2024
  • LiveCodeBench
  • Natural Questions
  • TriviaQA
  • HotpotQA
  • 2WIKI
  • MuSiQue
  • Bamboogle