Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

January 27, 20267 min

Overview

Decision SnapshotNeeds Validation

The paper includes benchmark results and ablations that support the design claims, but some ablations are limited to GAIA and system latency/token costs are not measured.

Citations0

Evidence Strength0.70

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Yuxuan Cai, Xinyi Lai, Peng Yuan, Weiting Liu, Huajian Li, Mingda Li, Xinghua Wang, Shengxie Zheng, Yanchao Hao, Yuyang Yin, Zheng Wei

Links

Abstract / PDF

Why It Matters For Business

Modular orchestration and structured memory reduce failed web/data tasks and improve first-run correctness, lowering human oversight and accelerating research pipelines.

Who Should Care

Summary TLDR

Yunque DeepResearch is a hierarchical, modular multi-agent framework for long-horizon research tasks. It groups interactions into sub-goal memory units, routes subtasks to an atomic pool of tools and sub-agents (browser GUI, data-analysis), and adds a Supervisor that interrupts and prunes failing trajectories. On four agentic benchmarks it reports Pass@1: BrowseComp 62.5, BrowseComp-ZH 75.9, GAIA 78.6, HLE 51.7 and shows model-agnostic gains (e.g., +10.0 on BrowseComp for Gemini 3 Pro). The code is claimed open-source; implementation and ablations are included.

Problem Statement

Current deep-research agents struggle with three practical problems: 1) context noise and information overload in long-horizon tasks; 2) fragile execution that causes cascading errors and loops; 3) rigid architectures that block integrating specialized tools and sub-agents.

Main Contribution

A hierarchical multi-agent architecture that decouples planning from execution via a Main Agent, Context Manager, Atomic Capability Pool, and Supervisor.

Sub-goal-driven structured memory that folds multi-round traces into concise memory units and switches context between detailed traces and compressed summaries.

Key Findings

Yunque achieves top Pass@1 on three browsing/data benchmarks.

NumbersBrowseComp 62.5; BrowseComp-ZH 75.9; HLE 51.7

Practical UseUse Yunque's architecture to improve single-run correctness on web and data tasks compared to standard ReAct pipelines.

Evidence RefTable 1 (Main Results)

Yunque is second on GAIA and improves base model performance.

NumbersGAIA 78.6; Gemini 3 Pro gain +10.0 on BrowseComp, +4.8 on GAIA

Practical UseFramework-level design (not only model choice) can unlock substantial accuracy gains across backbones; try the framework layer before swapping LLMs.

Evidence RefTable 1 and Section 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass@1 (BrowseComp)62.5Best prior reported ~60.2 (Kimi K2 thinking / other baselines vary)+~10.0 vs Gemini 3 Pro ReAct baselineBrowseCompMain Results Table 1Table 1
Pass@1 (BrowseComp-ZH)75.9Prior baselines reported mid-60s+8.8 vs Gemini 3 Pro ReAct (per Table 2)BrowseComp-ZHMain Results Table 1Table 1

What To Try In 7 Days

Add sub-goal memory: fold completed subtasks into compact summaries to cut context noise.

Introduce a lightweight Supervisor to detect and prune failed traces before retries.

Prototype a Browser + Data Analysis sub-agent pair with a one-tool-per-turn constraint.

Agent Features

Memory
sub-goal-driven structured memory units (4-tuple)dynamic folding and compression of completed sub-goals
Planning
dynamic task decomposition into sub-goalsadaptive routing to tools or specialized sub-agents
Tool Use
Browser-Use GUI Agent (interactive web actions)Data Analysis Agent (code generation + execution)one-tool-per-turn execution constraint
Frameworks
open-source implementation (claimed)composable Atomic Capability Pool for plugging new tools
Is Agentic

Yes

Architectures
hierarchical multi-agent (Main Agent + sub-agents)centralized orchestration with atomic capability pool
Collaboration
Main Agent delegates to specialized sub-agentsSupervisor intervenes with interrupts and recovery

Optimization Features

Token Efficiency
PDF paging and incremental markdown conversion to avoid injecting full long docs
System Optimization
one-tool-per-turn policy to decompose complex actionscontext complexity reduction from O(t) to O(n) via sub-goal folding

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Ablations on specialized sub-agents are mainly limited to GAIA and may not generalize to all domains.

No systematic measurement of token consumption, latency, or compute cost reported.

When Not To Use

Latency-sensitive production services where interactive browsing would add unacceptable delay.

Scenarios where the underlying LLM cannot reliably perform precise tool invocation.

Failure Modes

Cascading errors from malformed tool calls or repeated bad actions.

Context pollution where failed traces bias future reasoning.

Core Entities

Models

Gemini 3 ProDeepSeek-V3.2Kimi K2 ThinkingGLM 4.7GPT-5 HighClaude-4.5-SonnetOpenAI-o3

Metrics

Pass@1Pass@N

Datasets

GAIABrowseCompBrowseComp-ZHHumanity's Last Exam

Benchmarks

GAIABrowseCompBrowseComp-ZHHumanity's Last Exam