Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
Adding a small set of high-quality agents (search, coding, structured memory) can raise correctness on complex, knowledge‑intensive tasks by ~10 percentage points, enabling faster research and automation at the cost of higher compute and external data reliance.
Summary TLDR
Agentic Reasoning is a system that lets a reasoning LLM call three specialized agents — Web-Search, Coding, and a Mind-Map knowledge-graph memory — during a single reasoning chain. The pipeline breaks queries, reranks web results, runs code via a coding LLM, and stores structured context in a graph so the LLM can maintain coherence over long tool-heavy reasoning. On evaluated benchmarks the approach boosts accuracy by about 10–14 percentage points versus the base model and sets new public SOTA on several knowledge‑intensive tasks. Trade-offs: higher compute and latency and reliance on external source quality.
Problem Statement
Large reasoning LLMs still fail on open-ended, knowledge‑intensive tasks that need web research, repeated verification, calculations, and long logical chains. Pure internal reasoning or single retrieval steps lose context and make errors; we need a structured way for LLMs to use external tools and retain reasoning memory.
Main Contribution
Agentic Reasoning: a lightweight pipeline that lets a reasoning LLM call external LLM-based agents (Web-Search, Coding, Mind‑Map) during inference.
Mind‑Map: a knowledge-graph memory that converts reasoning chains into entities and relations, clusters them, and supplies concise context to agents and the reasoner.
A Web-Search agent design combining query breakdown, reranking, RAG, and Mind‑Map context that outperforms prior search-in-reasoning variants.
Extensive ablations showing the three tools (search, coding, Mind‑Map) are more effective together than large generic toolboxes.
Benchmarks and human evals showing large gains on expert QA and deep research tasks; code released on GitHub.
Key Findings
Agentic Reasoning raised Humanity's Last Exam accuracy to 23.8%, improving the base model by 14.4 percentage points.
On GPQA (graduate-level science QA) Agentic Reasoning with DeepSeek-R1 achieved 81.2% overall, about 9.7 points above the base model.
Agentic Reasoning sets new public SOTA on GAIA (Avg 66.13) and surpasses OpenAI Deep Research on Level 1 and 2 tasks while narrowing Level 3 gap to ~2.3 points.
In deep-research long-form generation (FreshWiki), our method improved ROUGE-1 and entity recall versus other RAG/search baselines.
The Mind‑Map memory strongly improves long-chain coherence and strategic reasoning (Werewolf): win rate 72% with Mind‑Map vs 36% without.
A small, focused toolset (web search + coding + Mind‑Map) outperformed large toolboxes; adding more tools often degraded performance.
Results
Accuracy
Accuracy
GAIA average score
FreshWiki ROUGE-1
Werewolf game win rate
Who Should Care
What To Try In 7 Days
Prototype a simple agentic pipeline: add web-search, code execution, and a small structured memory to your reasoning model.
Implement query breakdown + rerank (threshold ~0.7) for search and compare with plain RAG.
Run an ablation: start with web-search only, then add coding and a simple graph memory to measure marginal gains and latency trade-offs.
Agent Features
Memory
- Knowledge-graph Mind-Map: entity/relation store and cluster summaries
Planning
- Dynamic tool invocation during reasoning
- Iterative query refinement and reranking
Tool Use
- Web-Search agent (query breakdown + rerank + RAG)
- Coding agent (code gen + execute)
- Mind-Map agent (graph construction + cluster summaries)
Frameworks
- Hugging Face agents (for comparison)
- LangChain (ablation)
- Graph-RAG style retrieval
Is Agentic
true
Architectures
- LLM reasoning core with external agent calls
- Knowledge-graph (Mind-Map) memory component
Collaboration
- Multi-agent orchestration where specialized LLMs handle sub-tasks
Optimization Features
Token Efficiency
- Use agentic calls to break long reasoning into multiple LLM calls; supports >32k token context
Infra Optimization
- Route coding to a dedicated executor (Python 3.11) to avoid blocking the reasoner
System Optimization
- Iterative query refinement (max 3 iterations)
- Select top-20 pages from Bing then top-10 rerank average
Inference Optimization
- Assign task-specific models to agents to save cost (e.g., small model for summarization, Claude for
- Use rerank thresholding (0.7) to limit pages for RAG
Reproducibility
Data Urls
- Humanity's Last Exam (dataset name)
- GPQA (dataset name)
- GAIA (dataset name)
- FreshWiki (dataset name)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High compute and inference latency from sequential agent calls and web retrieval
- No built-in source credibility verification; vulnerable to bad web content
- Dependence on the quality of agent models (coding/search) and on underlying reasoning LLM
- Risk of cascading hallucinations despite Mind-Map; needs fact-checking for high-stakes use
When Not To Use
- Low-latency, real-time systems where added agent latency is unacceptable
- Safety-critical decisions without human verification or fact-checking
- Resource-constrained deployments where extra agent compute is infeasible
Failure Modes
- Hallucinated or incorrect web sources integrated into final reasoning
- Excessive or inappropriate tool calls when many tools are available
- Mind-Map construction errors that preserve incorrect intermediate claims
- Retrieval or rerank failures leading to missing critical evidence
Core Entities
Models
- DeepSeek-R1
- DeepSeek-V3
- QwQ-32B
- o1
- o3-mini
- Claude-3.5-sonnet
- GPT-4o
Metrics
- Accuracy
- ROUGE-1
- ROUGE-L
- Entity Recall
- Human evaluation scores (Interest/Organization/Relevance/Coverage)
- Win rate (Werewolf game)
Datasets
- Humanity's Last Exam
- GPQA
- GAIA
- FreshWiki
Benchmarks
- Humanity's Last Exam
- GPQA
- GAIA
- FreshWiki deep research

