Call web search, code execution and a 'Mind‑Map' memory agent to make LLMs do long, research-style reasoning

February 7, 20259 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

5

Authors

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, Yueming Jin

Links

Abstract / PDF

Why It Matters For Business

Adding a small set of high-quality agents (search, coding, structured memory) can raise correctness on complex, knowledge‑intensive tasks by ~10 percentage points, enabling faster research and automation at the cost of higher compute and external data reliance.

Summary TLDR

Agentic Reasoning is a system that lets a reasoning LLM call three specialized agents — Web-Search, Coding, and a Mind-Map knowledge-graph memory — during a single reasoning chain. The pipeline breaks queries, reranks web results, runs code via a coding LLM, and stores structured context in a graph so the LLM can maintain coherence over long tool-heavy reasoning. On evaluated benchmarks the approach boosts accuracy by about 10–14 percentage points versus the base model and sets new public SOTA on several knowledge‑intensive tasks. Trade-offs: higher compute and latency and reliance on external source quality.

Problem Statement

Large reasoning LLMs still fail on open-ended, knowledge‑intensive tasks that need web research, repeated verification, calculations, and long logical chains. Pure internal reasoning or single retrieval steps lose context and make errors; we need a structured way for LLMs to use external tools and retain reasoning memory.

Main Contribution

Agentic Reasoning: a lightweight pipeline that lets a reasoning LLM call external LLM-based agents (Web-Search, Coding, Mind‑Map) during inference.

Mind‑Map: a knowledge-graph memory that converts reasoning chains into entities and relations, clusters them, and supplies concise context to agents and the reasoner.

A Web-Search agent design combining query breakdown, reranking, RAG, and Mind‑Map context that outperforms prior search-in-reasoning variants.

Extensive ablations showing the three tools (search, coding, Mind‑Map) are more effective together than large generic toolboxes.

Benchmarks and human evals showing large gains on expert QA and deep research tasks; code released on GitHub.

Key Findings

Agentic Reasoning raised Humanity's Last Exam accuracy to 23.8%, improving the base model by 14.4 percentage points.

Numbers23.8% (Agentic w/ DeepSeek-R1) vs 9.4% (DeepSeek-R1); +14.4

On GPQA (graduate-level science QA) Agentic Reasoning with DeepSeek-R1 achieved 81.2% overall, about 9.7 points above the base model.

Numbers81.2% (Ours w/ DeepSeek-R1) vs 71.5% (DeepSeek-R1); +9.7

Agentic Reasoning sets new public SOTA on GAIA (Avg 66.13) and surpasses OpenAI Deep Research on Level 1 and 2 tasks while narrowing Level 3 gap to ~2.3 points.

Numbers66.13 (Ours) vs 67.36 (OpenAI Deep Research) avg; Level 3 gap = 2.26

In deep-research long-form generation (FreshWiki), our method improved ROUGE-1 and entity recall versus other RAG/search baselines.

NumbersROUGE-1 54.1 (Ours) vs 47.93 (STORM); Entity recall 18.77 vs 15.43

The Mind‑Map memory strongly improves long-chain coherence and strategic reasoning (Werewolf): win rate 72% with Mind‑Map vs 36% without.

Numbers72% (with Mind‑Map) vs 36% (without Mind‑Map); +36pp

A small, focused toolset (web search + coding + Mind‑Map) outperformed large toolboxes; adding more tools often degraded performance.

NumbersAblation shows smaller set beats HuggingFace 7-tool and LangChain 109-tool setups (Figure 3)

Results

Accuracy

Value23.8%

Baseline9.4% (DeepSeek-R1)

Accuracy

Value81.2%

Baseline71.5% (DeepSeek-R1)

GAIA average score

Value66.13

Baseline67.36 (OpenAI Deep Research)

FreshWiki ROUGE-1

Value54.1

Baseline47.93 (STORM)

Werewolf game win rate

Value72%

Baseline36% (same model without Mind-Map)

Who Should Care

What To Try In 7 Days

Prototype a simple agentic pipeline: add web-search, code execution, and a small structured memory to your reasoning model.

Implement query breakdown + rerank (threshold ~0.7) for search and compare with plain RAG.

Run an ablation: start with web-search only, then add coding and a simple graph memory to measure marginal gains and latency trade-offs.

Agent Features

Memory

  • Knowledge-graph Mind-Map: entity/relation store and cluster summaries

Planning

  • Dynamic tool invocation during reasoning
  • Iterative query refinement and reranking

Tool Use

  • Web-Search agent (query breakdown + rerank + RAG)
  • Coding agent (code gen + execute)
  • Mind-Map agent (graph construction + cluster summaries)

Frameworks

  • Hugging Face agents (for comparison)
  • LangChain (ablation)
  • Graph-RAG style retrieval

Is Agentic

true

Architectures

  • LLM reasoning core with external agent calls
  • Knowledge-graph (Mind-Map) memory component

Collaboration

  • Multi-agent orchestration where specialized LLMs handle sub-tasks

Optimization Features

Token Efficiency

  • Use agentic calls to break long reasoning into multiple LLM calls; supports >32k token context

Infra Optimization

  • Route coding to a dedicated executor (Python 3.11) to avoid blocking the reasoner

System Optimization

  • Iterative query refinement (max 3 iterations)
  • Select top-20 pages from Bing then top-10 rerank average

Inference Optimization

  • Assign task-specific models to agents to save cost (e.g., small model for summarization, Claude for
  • Use rerank thresholding (0.7) to limit pages for RAG

Reproducibility

Data Urls

  • Humanity's Last Exam (dataset name)
  • GPQA (dataset name)
  • GAIA (dataset name)
  • FreshWiki (dataset name)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High compute and inference latency from sequential agent calls and web retrieval
  • No built-in source credibility verification; vulnerable to bad web content
  • Dependence on the quality of agent models (coding/search) and on underlying reasoning LLM
  • Risk of cascading hallucinations despite Mind-Map; needs fact-checking for high-stakes use

When Not To Use

  • Low-latency, real-time systems where added agent latency is unacceptable
  • Safety-critical decisions without human verification or fact-checking
  • Resource-constrained deployments where extra agent compute is infeasible

Failure Modes

  • Hallucinated or incorrect web sources integrated into final reasoning
  • Excessive or inappropriate tool calls when many tools are available
  • Mind-Map construction errors that preserve incorrect intermediate claims
  • Retrieval or rerank failures leading to missing critical evidence

Core Entities

Models

  • DeepSeek-R1
  • DeepSeek-V3
  • QwQ-32B
  • o1
  • o3-mini
  • Claude-3.5-sonnet
  • GPT-4o

Metrics

  • Accuracy
  • ROUGE-1
  • ROUGE-L
  • Entity Recall
  • Human evaluation scores (Interest/Organization/Relevance/Coverage)
  • Win rate (Werewolf game)

Datasets

  • Humanity's Last Exam
  • GPQA
  • GAIA
  • FreshWiki

Benchmarks

  • Humanity's Last Exam
  • GPQA
  • GAIA
  • FreshWiki deep research