Make long-step reasoning models ask the web when they’re unsure and inject concise, refined facts back into the chain

January 9, 20258 min

Overview

Decision SnapshotReady For Pilot

The idea is straightforward: let the model call search when uncertain and refine documents before injection; experiments on many benchmarks support modest-to-strong gains, especially on multi-step tasks.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou

Links

Abstract / PDF / Code

Why It Matters For Business

Search-o1 lets deployed reasoning models fetch and condense live web facts as they reason, raising accuracy on complex, multi-step queries while cutting noise from raw documents.

Who Should Care

Summary TLDR

Search-o1 is a system that lets long chain-of-thought reasoning models (o1-style LRMs) call a web search when they hit uncertain facts, then runs a separate "Reason-in-Documents" step to compress and inject only the relevant facts back into the reasoning chain. On multiple hard reasoning and open-domain QA benchmarks, Search-o1 improves accuracy over direct reasoning, standard RAG, and an agentic-RAG baseline. The authors release code and show the method can beat expert-level scores on a graduate-level science benchmark in aggregate. Key trade-offs: extra latency and dependence on web search quality, but much lower document noise thanks to the refinement step.

Problem Statement

Large reasoning models that generate long step-by-step chains often hit local knowledge gaps. Single-shot retrieval (standard RAG) is too coarse because each reasoning step may need different facts, and directly inserting long web documents breaks chain coherence. The problem: enable LRMs to retrieve knowledge on-demand during generation and inject only concise, step-relevant facts without disrupting coherence.

Main Contribution

Search-o1: an agentic RAG framework that lets the reasoning model generate search queries mid-chain to retrieve external documents on demand.

Reason-in-Documents: a separate refinement module that reads retrieved pages and produces concise, step-focused knowledge to insert back into the chain.

Key Findings

Search-o1 improves average accuracy across five complex reasoning datasets compared to agentic-RAG and direct reasoning.

Numbersavg +4.7% vs RAgent-QwQ-32B; +3.1% vs QwQ-32B

Practical UseIf you already use o1-like LRMs, adding agentic search plus document refinement gives a modest but consistent accuracy boost across math, science, and code tasks.

Evidence RefMain Results (Sec 4.4), Table 1

On the GPQA extended set, Search-o1 reaches 57.9 overall, outscoring expert groups in aggregate.

NumbersSearch-o1 57.9 vs best human-expert group 48.9

Practical UseFor tough, domain-specialized QA, an LRM augmented with agentic retrieval and refinement can match or exceed aggregated expert accuracy on the evaluated set.

Evidence RefTable 2 (GPQA extended)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass@1 (GPQA overall)63.6QwQ-32B direct 58.1; RAgent-QwQ-32B 61.6+4.7 vs RAgent; +3.1 vs QwQGPQA (PhD-Level science, Table 1)Table 1 main resultsTable 1
Pass@1 (MATH500)86.4QwQ-32B direct 83.2; RAgent-QwQ-32B 85.0+1.4 vs RAgent; +3.2 vs directMATH500 (Table 1)Table 1 math columnTable 1

What To Try In 7 Days

Run Search-o1’s inference loop on a small LRM instance to compare direct reasoning vs agentic retrieval on your task.

Enable a Reason-in-Documents step that summarizes only step-relevant facts before appending them to the chain.

Measure token and API costs when switching from standard RAG (top-10 docs) to agentic+refine with 1–3 docs.

Agent Features

Memory
short-term retrieval context injected into chain
Planning
agentic search query generationdecision when to pause generation for retrieval
Tool Use
web search (Bing API)URL fetch via Jina Reader
Frameworks
Search-o1Reason-in-Documents
Is Agentic

Yes

Architectures
o1-like long-chain reasoning LRM

Optimization Features

Token Efficiency
refinement reduces verbose documents before inserting into chain
System Optimization
batch retrieval and batch Reason-in-Documents processing
Inference Optimization
batch inference for parallel query extraction and refinementretrieve-on-demand to avoid unnecessary documents

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Depends on third-party web search quality and availability (Bing API used in experiments).

Adds latency and API cost compared to offline direct reasoning.

When Not To Use

When web access is restricted or data privacy forbids external queries.

When low latency is critical and extra retrieval/refinement latency is unacceptable.

Failure Modes

Retrieved pages contain incorrect or misleading facts that the refinement step mis-parses and injects.

Search query generation misses the correct query, leading to irrelevant retrievals.

Core Entities

Models

QwQ-32B-PreviewQwen2.5-32B-InstructQwen2.5-72B-InstructLlama3.3-70B-Instructo1-previewGPT-4oDeepSeek-R1-Lite

Metrics

Pass@1Exact Match (EM)F1

Datasets

GPQAMATH500AMC2023AIME2024LiveCodeBenchNaturalQuestionsTriviaQAHotpotQA2WIKIMuSiQueBamboogle

Benchmarks

GPQA (diamond & extended)MATH500AMC2023AIME2024LiveCodeBenchNatural QuestionsTriviaQAHotpotQA2WIKIMuSiQueBamboogle