Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

January 27, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Yuxuan Cai, Xinyi Lai, Peng Yuan, Weiting Liu, Huajian Li, Mingda Li, Xinghua Wang, Shengxie Zheng, Yanchao Hao, Yuyang Yin, Zheng Wei

Links

Abstract / PDF

Why It Matters For Business

Modular orchestration and structured memory reduce failed web/data tasks and improve first-run correctness, lowering human oversight and accelerating research pipelines.

Summary TLDR

Yunque DeepResearch is a hierarchical, modular multi-agent framework for long-horizon research tasks. It groups interactions into sub-goal memory units, routes subtasks to an atomic pool of tools and sub-agents (browser GUI, data-analysis), and adds a Supervisor that interrupts and prunes failing trajectories. On four agentic benchmarks it reports Pass@1: BrowseComp 62.5, BrowseComp-ZH 75.9, GAIA 78.6, HLE 51.7 and shows model-agnostic gains (e.g., +10.0 on BrowseComp for Gemini 3 Pro). The code is claimed open-source; implementation and ablations are included.

Problem Statement

Current deep-research agents struggle with three practical problems: 1) context noise and information overload in long-horizon tasks; 2) fragile execution that causes cascading errors and loops; 3) rigid architectures that block integrating specialized tools and sub-agents.

Main Contribution

A hierarchical multi-agent architecture that decouples planning from execution via a Main Agent, Context Manager, Atomic Capability Pool, and Supervisor.

Sub-goal-driven structured memory that folds multi-round traces into concise memory units and switches context between detailed traces and compressed summaries.

Atomic Capability Pool with specialized sub-agents (Browser-Use GUI, Data Analysis) and a strict one-tool-per-turn interface.

A Supervisor module that detects anomalies, prunes invalid traces, and forces adaptive interrupts and recovery.

Empirical evaluation showing strong gains on BrowseComp, BrowseComp-ZH, GAIA, and Humanity's Last Exam; open-source release promised.

Key Findings

Yunque achieves top Pass@1 on three browsing/data benchmarks.

NumbersBrowseComp 62.5; BrowseComp-ZH 75.9; HLE 51.7

Yunque is second on GAIA and improves base model performance.

NumbersGAIA 78.6; Gemini 3 Pro gain +10.0 on BrowseComp, +4.8 on GAIA

Structured memory strongly affects browsing results.

NumbersRemoving memory: -10.4 on BrowseComp, -7.4 on BrowseComp-ZH

Supervisor reduces execution fragility.

NumbersRemoving Supervisor: -8.7 on GAIA and -10.5 on BrowseComp-ZH (reported declines)

Results

Pass@1 (BrowseComp)

Value62.5

BaselineBest prior reported ~60.2 (Kimi K2 thinking / other baselines vary)

Pass@1 (BrowseComp-ZH)

Value75.9

BaselinePrior baselines reported mid-60s

Pass@1 (GAIA)

Value78.6

BaselineMiroFlow reported 82.4* (higher), others lower

Pass@1 (Humanity's Last Exam)

Value51.7

BaselineMany baselines lower (e.g., Gemini DeepResearch 26.9* reported)

Who Should Care

What To Try In 7 Days

Add sub-goal memory: fold completed subtasks into compact summaries to cut context noise.

Introduce a lightweight Supervisor to detect and prune failed traces before retries.

Prototype a Browser + Data Analysis sub-agent pair with a one-tool-per-turn constraint.

Agent Features

Memory

  • sub-goal-driven structured memory units (4-tuple)
  • dynamic folding and compression of completed sub-goals

Planning

  • dynamic task decomposition into sub-goals
  • adaptive routing to tools or specialized sub-agents

Tool Use

  • Browser-Use GUI Agent (interactive web actions)
  • Data Analysis Agent (code generation + execution)
  • one-tool-per-turn execution constraint

Frameworks

  • open-source implementation (claimed)
  • composable Atomic Capability Pool for plugging new tools

Is Agentic

true

Architectures

  • hierarchical multi-agent (Main Agent + sub-agents)
  • centralized orchestration with atomic capability pool

Collaboration

  • Main Agent delegates to specialized sub-agents
  • Supervisor intervenes with interrupts and recovery

Optimization Features

Token Efficiency

  • PDF paging and incremental markdown conversion to avoid injecting full long docs

System Optimization

  • one-tool-per-turn policy to decompose complex actions
  • context complexity reduction from O(t) to O(n) via sub-goal folding

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Ablations on specialized sub-agents are mainly limited to GAIA and may not generalize to all domains.
  • No systematic measurement of token consumption, latency, or compute cost reported.
  • Claims of open-source code are made but links and licenses are not provided in the text.

When Not To Use

  • Latency-sensitive production services where interactive browsing would add unacceptable delay.
  • Scenarios where the underlying LLM cannot reliably perform precise tool invocation.
  • Domain-specific tasks not covered by the evaluated benchmarks without further tuning.

Failure Modes

  • Cascading errors from malformed tool calls or repeated bad actions.
  • Context pollution where failed traces bias future reasoning.
  • Dependency on backbone model's ability to execute tool calls accurately.

Core Entities

Models

  • Gemini 3 Pro
  • DeepSeek-V3.2
  • Kimi K2 Thinking
  • GLM 4.7
  • GPT-5 High
  • Claude-4.5-Sonnet
  • OpenAI-o3

Metrics

  • Pass@1
  • Pass@N

Datasets

  • GAIA
  • BrowseComp
  • BrowseComp-ZH
  • Humanity's Last Exam

Benchmarks

  • GAIA
  • BrowseComp
  • BrowseComp-ZH
  • Humanity's Last Exam