Call web search, code execution and a 'Mind‑Map' memory agent to make LLMs do long, research-style reasoning

February 7, 20259 min

Overview

Decision SnapshotNeeds Validation

The paper provides multiple benchmark wins and human evaluations, but the approach adds compute, depends on external sources, and uses some proprietary models, so expect engineering work before full deployment.

Citations5

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, Yueming Jin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Adding a small set of high-quality agents (search, coding, structured memory) can raise correctness on complex, knowledge‑intensive tasks by ~10 percentage points, enabling faster research and automation at the cost of higher compute and external data reliance.

Who Should Care

Summary TLDR

Agentic Reasoning is a system that lets a reasoning LLM call three specialized agents — Web-Search, Coding, and a Mind-Map knowledge-graph memory — during a single reasoning chain. The pipeline breaks queries, reranks web results, runs code via a coding LLM, and stores structured context in a graph so the LLM can maintain coherence over long tool-heavy reasoning. On evaluated benchmarks the approach boosts accuracy by about 10–14 percentage points versus the base model and sets new public SOTA on several knowledge‑intensive tasks. Trade-offs: higher compute and latency and reliance on external source quality.

Problem Statement

Large reasoning LLMs still fail on open-ended, knowledge‑intensive tasks that need web research, repeated verification, calculations, and long logical chains. Pure internal reasoning or single retrieval steps lose context and make errors; we need a structured way for LLMs to use external tools and retain reasoning memory.

Main Contribution

Agentic Reasoning: a lightweight pipeline that lets a reasoning LLM call external LLM-based agents (Web-Search, Coding, Mind‑Map) during inference.

Mind‑Map: a knowledge-graph memory that converts reasoning chains into entities and relations, clusters them, and supplies concise context to agents and the reasoner.

Key Findings

Agentic Reasoning raised Humanity's Last Exam accuracy to 23.8%, improving the base model by 14.4 percentage points.

Numbers23.8% (Agentic w/ DeepSeek-R1) vs 9.4% (DeepSeek-R1); +14.4

Practical UseIf you augment a strong reasoning model with these agents, expect double-digit accuracy gains on complex exam-like tasks; useful for research automation and expert QA.

Evidence RefTable 1

On GPQA (graduate-level science QA) Agentic Reasoning with DeepSeek-R1 achieved 81.2% overall, about 9.7 points above the base model.

Numbers81.2% (Ours w/ DeepSeek-R1) vs 71.5% (DeepSeek-R1); +9.7

Practical UseFor difficult fact- and reasoning-heavy QA, integrating search+code+structured memory can materially improve correctness on evaluated benchmarks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy23.8%9.4% (DeepSeek-R1)+14.4Humanity's Last ExamAgentic Reasoning w/ DeepSeek-R1 outperforms base modelTable 1
Accuracy81.2%71.5% (DeepSeek-R1)+9.7GPQA (Physics/Chemistry/Biology aggregated)Ours w/ DeepSeekR1 achieves new SOTA on GPQATable 2

What To Try In 7 Days

Prototype a simple agentic pipeline: add web-search, code execution, and a small structured memory to your reasoning model.

Implement query breakdown + rerank (threshold ~0.7) for search and compare with plain RAG.

Run an ablation: start with web-search only, then add coding and a simple graph memory to measure marginal gains and latency trade-offs.

Agent Features

Memory
Knowledge-graph Mind-Map: entity/relation store and cluster summaries
Planning
Dynamic tool invocation during reasoningIterative query refinement and reranking
Tool Use
Web-Search agent (query breakdown + rerank + RAG)Coding agent (code gen + execute)Mind-Map agent (graph construction + cluster summaries)
Frameworks
Hugging Face agents (for comparison)LangChain (ablation)Graph-RAG style retrieval
Is Agentic

Yes

Architectures
LLM reasoning core with external agent callsKnowledge-graph (Mind-Map) memory component
Collaboration
Multi-agent orchestration where specialized LLMs handle sub-tasks

Optimization Features

Token Efficiency
Use agentic calls to break long reasoning into multiple LLM calls; supports >32k token context
Infra Optimization
Route coding to a dedicated executor (Python 3.11) to avoid blocking the reasoner
System Optimization
Iterative query refinement (max 3 iterations)Select top-20 pages from Bing then top-10 rerank average
Inference Optimization

Assign task-specific models to agents to save cost (e.g., small model for summarization, Claude for

Use rerank thresholding (0.7) to limit pages for RAG

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Humanity's Last Exam (dataset name)GPQA (dataset name)GAIA (dataset name)FreshWiki (dataset name)

Risks & Boundaries

Limitations

High compute and inference latency from sequential agent calls and web retrieval

No built-in source credibility verification; vulnerable to bad web content

When Not To Use

Low-latency, real-time systems where added agent latency is unacceptable

Safety-critical decisions without human verification or fact-checking

Failure Modes

Hallucinated or incorrect web sources integrated into final reasoning

Excessive or inappropriate tool calls when many tools are available

Core Entities

Models

DeepSeek-R1DeepSeek-V3QwQ-32Bo1o3-miniClaude-3.5-sonnetGPT-4o

Metrics

AccuracyROUGE-1ROUGE-LEntity RecallHuman evaluation scores (Interest/Organization/Relevance/Coverage)Win rate (Werewolf game)

Datasets

Humanity's Last ExamGPQAGAIAFreshWiki

Benchmarks

Humanity's Last ExamGPQAGAIAFreshWiki deep research