Call web search, code execution and a 'Mind‑Map' memory agent to make LLMs do long, research-style reasoning

Overview

Decision SnapshotNeeds Validation

The paper provides multiple benchmark wins and human evaluations, but the approach adds compute, depends on external sources, and uses some proprietary models, so expect engineering work before full deployment.

Citations5

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, Yueming Jin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Adding a small set of high-quality agents (search, coding, structured memory) can raise correctness on complex, knowledge‑intensive tasks by ~10 percentage points, enabling faster research and automation at the cost of higher compute and external data reliance.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

Agentic Reasoning is a system that lets a reasoning LLM call three specialized agents — Web-Search, Coding, and a Mind-Map knowledge-graph memory — during a single reasoning chain. The pipeline breaks queries, reranks web results, runs code via a coding LLM, and stores structured context in a graph so the LLM can maintain coherence over long tool-heavy reasoning. On evaluated benchmarks the approach boosts accuracy by about 10–14 percentage points versus the base model and sets new public SOTA on several knowledge‑intensive tasks. Trade-offs: higher compute and latency and reliance on external source quality.

Problem Statement

Large reasoning LLMs still fail on open-ended, knowledge‑intensive tasks that need web research, repeated verification, calculations, and long logical chains. Pure internal reasoning or single retrieval steps lose context and make errors; we need a structured way for LLMs to use external tools and retain reasoning memory.

Main Contribution

Agentic Reasoning: a lightweight pipeline that lets a reasoning LLM call external LLM-based agents (Web-Search, Coding, Mind‑Map) during inference.

Mind‑Map: a knowledge-graph memory that converts reasoning chains into entities and relations, clusters them, and supplies concise context to agents and the reasoner.

Key Findings

Agentic Reasoning raised Humanity's Last Exam accuracy to 23.8%, improving the base model by 14.4 percentage points.

Numbers23.8% (Agentic w/ DeepSeek-R1) vs 9.4% (DeepSeek-R1); +14.4

Practical UseIf you augment a strong reasoning model with these agents, expect double-digit accuracy gains on complex exam-like tasks; useful for research automation and expert QA.

Evidence RefTable 1

On GPQA (graduate-level science QA) Agentic Reasoning with DeepSeek-R1 achieved 81.2% overall, about 9.7 points above the base model.

Numbers81.2% (Ours w/ DeepSeek-R1) vs 71.5% (DeepSeek-R1); +9.7

Practical UseFor difficult fact- and reasoning-heavy QA, integrating search+code+structured memory can materially improve correctness on evaluated benchmarks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	23.8%	9.4% (DeepSeek-R1)	+14.4	Humanity's Last Exam	Agentic Reasoning w/ DeepSeek-R1 outperforms base model	Table 1
Accuracy	81.2%	71.5% (DeepSeek-R1)	+9.7	GPQA (Physics/Chemistry/Biology aggregated)	Ours w/ DeepSeekR1 achieves new SOTA on GPQA	Table 2

What To Try In 7 Days

Prototype a simple agentic pipeline: add web-search, code execution, and a small structured memory to your reasoning model.

Implement query breakdown + rerank (threshold ~0.7) for search and compare with plain RAG.

Run an ablation: start with web-search only, then add coding and a simple graph memory to measure marginal gains and latency trade-offs.

Agent Features

Memory

Knowledge-graph Mind-Map: entity/relation store and cluster summaries

Planning

Dynamic tool invocation during reasoningIterative query refinement and reranking

Tool Use

Web-Search agent (query breakdown + rerank + RAG)Coding agent (code gen + execute)Mind-Map agent (graph construction + cluster summaries)

Frameworks

Hugging Face agents (for comparison)LangChain (ablation)Graph-RAG style retrieval

Is Agentic

Yes

Architectures

LLM reasoning core with external agent callsKnowledge-graph (Mind-Map) memory component

Collaboration

Multi-agent orchestration where specialized LLMs handle sub-tasks

Optimization Features

Token Efficiency

Use agentic calls to break long reasoning into multiple LLM calls; supports >32k token context

Infra Optimization

Route coding to a dedicated executor (Python 3.11) to avoid blocking the reasoner

System Optimization

Iterative query refinement (max 3 iterations)Select top-20 pages from Bing then top-10 rerank average

Inference Optimization

Assign task-specific models to agents to save cost (e.g., small model for summarization, Claude for

Use rerank thresholding (0.7) to limit pages for RAG

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/theworldofagents/Agentic-Reasoning

Data URLs

Humanity's Last Exam (dataset name)GPQA (dataset name)GAIA (dataset name)FreshWiki (dataset name)

Risks & Boundaries

Limitations

High compute and inference latency from sequential agent calls and web retrieval

No built-in source credibility verification; vulnerable to bad web content

When Not To Use

Low-latency, real-time systems where added agent latency is unacceptable

Safety-critical decisions without human verification or fact-checking

Failure Modes

Hallucinated or incorrect web sources integrated into final reasoning

Excessive or inappropriate tool calls when many tools are available

Core Entities

Models

DeepSeek-R1DeepSeek-V3QwQ-32Bo1o3-miniClaude-3.5-sonnetGPT-4o

Metrics

AccuracyROUGE-1ROUGE-LEntity RecallHuman evaluation scores (Interest/Organization/Relevance/Coverage)Win rate (Werewolf game)

Datasets

Humanity's Last ExamGPQAGAIAFreshWiki

Benchmarks

Humanity's Last ExamGPQAGAIAFreshWiki deep research

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Agentic Reasoning raised Humanity's Last Exam accuracy to 23.8%, improving the base model by 14.4 percentage points.

On GPQA (graduate-level science QA) Agentic Reasoning with DeepSeek-R1 achieved 81.2% overall, about 9.7 points above the base model.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

A 1,000-task, real-server benchmark that measures how well LLMs discover and use tools

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

A runnable benchmark of 760 real financial tools and 295 tool-required questions for auditing LLM agents

Key finding