Overview
The idea is straightforward: let the model call search when uncertain and refine documents before injection; experiments on many benchmarks support modest-to-strong gains, especially on multi-step tasks.
Citations3
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Search-o1 lets deployed reasoning models fetch and condense live web facts as they reason, raising accuracy on complex, multi-step queries while cutting noise from raw documents.
Who Should Care
Summary TLDR
Search-o1 is a system that lets long chain-of-thought reasoning models (o1-style LRMs) call a web search when they hit uncertain facts, then runs a separate "Reason-in-Documents" step to compress and inject only the relevant facts back into the reasoning chain. On multiple hard reasoning and open-domain QA benchmarks, Search-o1 improves accuracy over direct reasoning, standard RAG, and an agentic-RAG baseline. The authors release code and show the method can beat expert-level scores on a graduate-level science benchmark in aggregate. Key trade-offs: extra latency and dependence on web search quality, but much lower document noise thanks to the refinement step.
Problem Statement
Large reasoning models that generate long step-by-step chains often hit local knowledge gaps. Single-shot retrieval (standard RAG) is too coarse because each reasoning step may need different facts, and directly inserting long web documents breaks chain coherence. The problem: enable LRMs to retrieve knowledge on-demand during generation and inject only concise, step-relevant facts without disrupting coherence.
Main Contribution
Search-o1: an agentic RAG framework that lets the reasoning model generate search queries mid-chain to retrieve external documents on demand.
Reason-in-Documents: a separate refinement module that reads retrieved pages and produces concise, step-focused knowledge to insert back into the chain.
Key Findings
Search-o1 improves average accuracy across five complex reasoning datasets compared to agentic-RAG and direct reasoning.
On the GPQA extended set, Search-o1 reaches 57.9 overall, outscoring expert groups in aggregate.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Pass@1 (GPQA overall) | 63.6 | QwQ-32B direct 58.1; RAgent-QwQ-32B 61.6 | +4.7 vs RAgent; +3.1 vs QwQ | GPQA (PhD-Level science, Table 1) | Table 1 main results | Table 1 |
| Pass@1 (MATH500) | 86.4 | QwQ-32B direct 83.2; RAgent-QwQ-32B 85.0 | +1.4 vs RAgent; +3.2 vs direct | MATH500 (Table 1) | Table 1 math column | Table 1 |
What To Try In 7 Days
Run Search-o1’s inference loop on a small LRM instance to compare direct reasoning vs agentic retrieval on your task.
Enable a Reason-in-Documents step that summarizes only step-relevant facts before appending them to the chain.
Measure token and API costs when switching from standard RAG (top-10 docs) to agentic+refine with 1–3 docs.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Depends on third-party web search quality and availability (Bing API used in experiments).
Adds latency and API cost compared to offline direct reasoning.
When Not To Use
When web access is restricted or data privacy forbids external queries.
When low latency is critical and extra retrieval/refinement latency is unacceptable.
Failure Modes
Retrieved pages contain incorrect or misleading facts that the refinement step mis-parses and injects.
Search query generation misses the correct query, leading to irrelevant retrievals.

