Make long-step reasoning models ask the web when they’re unsure and inject concise, refined facts back into the chain

Overview

Decision SnapshotReady For Pilot

The idea is straightforward: let the model call search when uncertain and refine documents before injection; experiments on many benchmarks support modest-to-strong gains, especially on multi-step tasks.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou

Links

Abstract / PDF / Code

Why It Matters For Business

Search-o1 lets deployed reasoning models fetch and condense live web facts as they reason, raising accuracy on complex, multi-step queries while cutting noise from raw documents.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

Search-o1 is a system that lets long chain-of-thought reasoning models (o1-style LRMs) call a web search when they hit uncertain facts, then runs a separate "Reason-in-Documents" step to compress and inject only the relevant facts back into the reasoning chain. On multiple hard reasoning and open-domain QA benchmarks, Search-o1 improves accuracy over direct reasoning, standard RAG, and an agentic-RAG baseline. The authors release code and show the method can beat expert-level scores on a graduate-level science benchmark in aggregate. Key trade-offs: extra latency and dependence on web search quality, but much lower document noise thanks to the refinement step.

Problem Statement

Large reasoning models that generate long step-by-step chains often hit local knowledge gaps. Single-shot retrieval (standard RAG) is too coarse because each reasoning step may need different facts, and directly inserting long web documents breaks chain coherence. The problem: enable LRMs to retrieve knowledge on-demand during generation and inject only concise, step-relevant facts without disrupting coherence.

Main Contribution

Search-o1: an agentic RAG framework that lets the reasoning model generate search queries mid-chain to retrieve external documents on demand.

Reason-in-Documents: a separate refinement module that reads retrieved pages and produces concise, step-focused knowledge to insert back into the chain.

Key Findings

Search-o1 improves average accuracy across five complex reasoning datasets compared to agentic-RAG and direct reasoning.

Numbersavg +4.7% vs RAgent-QwQ-32B; +3.1% vs QwQ-32B

Practical UseIf you already use o1-like LRMs, adding agentic search plus document refinement gives a modest but consistent accuracy boost across math, science, and code tasks.

Evidence RefMain Results (Sec 4.4), Table 1

On the GPQA extended set, Search-o1 reaches 57.9 overall, outscoring expert groups in aggregate.

NumbersSearch-o1 57.9 vs best human-expert group 48.9

Practical UseFor tough, domain-specialized QA, an LRM augmented with agentic retrieval and refinement can match or exceed aggregated expert accuracy on the evaluated set.

Evidence RefTable 2 (GPQA extended)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass@1 (GPQA overall)	63.6	QwQ-32B direct 58.1; RAgent-QwQ-32B 61.6	+4.7 vs RAgent; +3.1 vs QwQ	GPQA (PhD-Level science, Table 1)	Table 1 main results	Table 1
Pass@1 (MATH500)	86.4	QwQ-32B direct 83.2; RAgent-QwQ-32B 85.0	+1.4 vs RAgent; +3.2 vs direct	MATH500 (Table 1)	Table 1 math column	Table 1

What To Try In 7 Days

Run Search-o1’s inference loop on a small LRM instance to compare direct reasoning vs agentic retrieval on your task.

Enable a Reason-in-Documents step that summarizes only step-relevant facts before appending them to the chain.

Measure token and API costs when switching from standard RAG (top-10 docs) to agentic+refine with 1–3 docs.

Agent Features

Memory

short-term retrieval context injected into chain

Planning

agentic search query generationdecision when to pause generation for retrieval

Tool Use

web search (Bing API)URL fetch via Jina Reader

Frameworks

Search-o1Reason-in-Documents

Is Agentic

Yes

Architectures

o1-like long-chain reasoning LRM

Optimization Features

Token Efficiency

refinement reduces verbose documents before inserting into chain

System Optimization

batch retrieval and batch Reason-in-Documents processing

Inference Optimization

batch inference for parallel query extraction and refinementretrieve-on-demand to avoid unnecessary documents

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/sunnynexus/Search-o1

Risks & Boundaries

Limitations

Depends on third-party web search quality and availability (Bing API used in experiments).

Adds latency and API cost compared to offline direct reasoning.

When Not To Use

When web access is restricted or data privacy forbids external queries.

When low latency is critical and extra retrieval/refinement latency is unacceptable.

Failure Modes

Retrieved pages contain incorrect or misleading facts that the refinement step mis-parses and injects.

Search query generation misses the correct query, leading to irrelevant retrievals.

Core Entities

Models

QwQ-32B-PreviewQwen2.5-32B-InstructQwen2.5-72B-InstructLlama3.3-70B-Instructo1-previewGPT-4oDeepSeek-R1-Lite

Metrics

Pass@1Exact Match (EM)F1

Datasets

GPQAMATH500AMC2023AIME2024LiveCodeBenchNaturalQuestionsTriviaQAHotpotQA2WIKIMuSiQueBamboogle

Benchmarks

GPQA (diamond & extended)MATH500AMC2023AIME2024LiveCodeBenchNatural QuestionsTriviaQAHotpotQA2WIKIMuSiQueBamboogle

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Search-o1 improves average accuracy across five complex reasoning datasets compared to agentic-RAG and direct reasoning.

On the GPQA extended set, Search-o1 reaches 57.9 overall, outscoring expert groups in aggregate.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding