Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
Search-o1 lets deployed reasoning models fetch and condense live web facts as they reason, raising accuracy on complex, multi-step queries while cutting noise from raw documents.
Summary TLDR
Search-o1 is a system that lets long chain-of-thought reasoning models (o1-style LRMs) call a web search when they hit uncertain facts, then runs a separate "Reason-in-Documents" step to compress and inject only the relevant facts back into the reasoning chain. On multiple hard reasoning and open-domain QA benchmarks, Search-o1 improves accuracy over direct reasoning, standard RAG, and an agentic-RAG baseline. The authors release code and show the method can beat expert-level scores on a graduate-level science benchmark in aggregate. Key trade-offs: extra latency and dependence on web search quality, but much lower document noise thanks to the refinement step.
Problem Statement
Large reasoning models that generate long step-by-step chains often hit local knowledge gaps. Single-shot retrieval (standard RAG) is too coarse because each reasoning step may need different facts, and directly inserting long web documents breaks chain coherence. The problem: enable LRMs to retrieve knowledge on-demand during generation and inject only concise, step-relevant facts without disrupting coherence.
Main Contribution
Search-o1: an agentic RAG framework that lets the reasoning model generate search queries mid-chain to retrieve external documents on demand.
Reason-in-Documents: a separate refinement module that reads retrieved pages and produces concise, step-focused knowledge to insert back into the chain.
Batch inference and evaluation showing improvements across five complex reasoning domains and six open QA benchmarks, including comparisons to human experts on GPQA.
Key Findings
Search-o1 improves average accuracy across five complex reasoning datasets compared to agentic-RAG and direct reasoning.
On the GPQA extended set, Search-o1 reaches 57.9 overall, outscoring expert groups in aggregate.
Agentic RAG combined with Reason-in-Documents needs far fewer documents to help reasoning than standard RAG.
Search-o1 yields larger relative gains on multi-hop and complex QA than on single-hop QA.
Results
Pass@1 (GPQA overall)
Pass@1 (MATH500)
GPQA extended overall
Multi-hop QA (average EM)
Who Should Care
What To Try In 7 Days
Run Search-o1’s inference loop on a small LRM instance to compare direct reasoning vs agentic retrieval on your task.
Enable a Reason-in-Documents step that summarizes only step-relevant facts before appending them to the chain.
Measure token and API costs when switching from standard RAG (top-10 docs) to agentic+refine with 1–3 docs.
Agent Features
Memory
- short-term retrieval context injected into chain
Planning
- agentic search query generation
- decision when to pause generation for retrieval
Tool Use
- web search (Bing API)
- URL fetch via Jina Reader
Frameworks
- Search-o1
- Reason-in-Documents
Is Agentic
true
Architectures
- o1-like long-chain reasoning LRM
Optimization Features
Token Efficiency
- refinement reduces verbose documents before inserting into chain
System Optimization
- batch retrieval and batch Reason-in-Documents processing
Inference Optimization
- batch inference for parallel query extraction and refinement
- retrieve-on-demand to avoid unnecessary documents
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Depends on third-party web search quality and availability (Bing API used in experiments).
- Adds latency and API cost compared to offline direct reasoning.
- Evaluations use a single LRM backbone (QwQ-32B-Preview) for most ablations; transfer to other LRMs may vary.
- Refinement uses the same LRM, so errors in document analysis can still propagate.
When Not To Use
- When web access is restricted or data privacy forbids external queries.
- When low latency is critical and extra retrieval/refinement latency is unacceptable.
- For simple single-hop lookups where standard RAG or cached answers suffice.
Failure Modes
- Retrieved pages contain incorrect or misleading facts that the refinement step mis-parses and injects.
- Search query generation misses the correct query, leading to irrelevant retrievals.
- Over-reliance on web sources causes brittle behavior when the web has conflicting information.
Core Entities
Models
- QwQ-32B-Preview
- Qwen2.5-32B-Instruct
- Qwen2.5-72B-Instruct
- Llama3.3-70B-Instruct
- o1-preview
- GPT-4o
- DeepSeek-R1-Lite
Metrics
- Pass@1
- Exact Match (EM)
- F1
Datasets
- GPQA
- MATH500
- AMC2023
- AIME2024
- LiveCodeBench
- NaturalQuestions
- TriviaQA
- HotpotQA
- 2WIKI
- MuSiQue
- Bamboogle
Benchmarks
- GPQA (diamond & extended)
- MATH500
- AMC2023
- AIME2024
- LiveCodeBench
- Natural Questions
- TriviaQA
- HotpotQA
- 2WIKI
- MuSiQue
- Bamboogle

