Overview
The paper provides multiple benchmark wins and human evaluations, but the approach adds compute, depends on external sources, and uses some proprietary models, so expect engineering work before full deployment.
Citations5
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Adding a small set of high-quality agents (search, coding, structured memory) can raise correctness on complex, knowledge‑intensive tasks by ~10 percentage points, enabling faster research and automation at the cost of higher compute and external data reliance.
Who Should Care
Summary TLDR
Agentic Reasoning is a system that lets a reasoning LLM call three specialized agents — Web-Search, Coding, and a Mind-Map knowledge-graph memory — during a single reasoning chain. The pipeline breaks queries, reranks web results, runs code via a coding LLM, and stores structured context in a graph so the LLM can maintain coherence over long tool-heavy reasoning. On evaluated benchmarks the approach boosts accuracy by about 10–14 percentage points versus the base model and sets new public SOTA on several knowledge‑intensive tasks. Trade-offs: higher compute and latency and reliance on external source quality.
Problem Statement
Large reasoning LLMs still fail on open-ended, knowledge‑intensive tasks that need web research, repeated verification, calculations, and long logical chains. Pure internal reasoning or single retrieval steps lose context and make errors; we need a structured way for LLMs to use external tools and retain reasoning memory.
Main Contribution
Agentic Reasoning: a lightweight pipeline that lets a reasoning LLM call external LLM-based agents (Web-Search, Coding, Mind‑Map) during inference.
Mind‑Map: a knowledge-graph memory that converts reasoning chains into entities and relations, clusters them, and supplies concise context to agents and the reasoner.
Key Findings
Agentic Reasoning raised Humanity's Last Exam accuracy to 23.8%, improving the base model by 14.4 percentage points.
On GPQA (graduate-level science QA) Agentic Reasoning with DeepSeek-R1 achieved 81.2% overall, about 9.7 points above the base model.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 23.8% | 9.4% (DeepSeek-R1) | +14.4 | Humanity's Last Exam | Agentic Reasoning w/ DeepSeek-R1 outperforms base model | Table 1 |
| Accuracy | 81.2% | 71.5% (DeepSeek-R1) | +9.7 | GPQA (Physics/Chemistry/Biology aggregated) | Ours w/ DeepSeekR1 achieves new SOTA on GPQA | Table 2 |
What To Try In 7 Days
Prototype a simple agentic pipeline: add web-search, code execution, and a small structured memory to your reasoning model.
Implement query breakdown + rerank (threshold ~0.7) for search and compare with plain RAG.
Run an ablation: start with web-search only, then add coding and a simple graph memory to measure marginal gains and latency trade-offs.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Assign task-specific models to agents to save cost (e.g., small model for summarization, Claude for
Use rerank thresholding (0.7) to limit pages for RAG
Reproducibility
Data URLs
Risks & Boundaries
Limitations
High compute and inference latency from sequential agent calls and web retrieval
No built-in source credibility verification; vulnerable to bad web content
When Not To Use
Low-latency, real-time systems where added agent latency is unacceptable
Safety-critical decisions without human verification or fact-checking
Failure Modes
Hallucinated or incorrect web sources integrated into final reasoning
Excessive or inappropriate tool calls when many tools are available

