Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Overview

Decision SnapshotNeeds Validation

The paper includes benchmark results and ablations that support the design claims, but some ablations are limited to GAIA and system latency/token costs are not measured.

Citations0

Evidence Strength0.70

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Yuxuan Cai, Xinyi Lai, Peng Yuan, Weiting Liu, Huajian Li, Mingda Li, Xinghua Wang, Shengxie Zheng, Yanchao Hao, Yuyang Yin, Zheng Wei

Links

Abstract / PDF

Why It Matters For Business

Modular orchestration and structured memory reduce failed web/data tasks and improve first-run correctness, lowering human oversight and accelerating research pipelines.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

Yunque DeepResearch is a hierarchical, modular multi-agent framework for long-horizon research tasks. It groups interactions into sub-goal memory units, routes subtasks to an atomic pool of tools and sub-agents (browser GUI, data-analysis), and adds a Supervisor that interrupts and prunes failing trajectories. On four agentic benchmarks it reports Pass@1: BrowseComp 62.5, BrowseComp-ZH 75.9, GAIA 78.6, HLE 51.7 and shows model-agnostic gains (e.g., +10.0 on BrowseComp for Gemini 3 Pro). The code is claimed open-source; implementation and ablations are included.

Problem Statement

Current deep-research agents struggle with three practical problems: 1) context noise and information overload in long-horizon tasks; 2) fragile execution that causes cascading errors and loops; 3) rigid architectures that block integrating specialized tools and sub-agents.

Main Contribution

A hierarchical multi-agent architecture that decouples planning from execution via a Main Agent, Context Manager, Atomic Capability Pool, and Supervisor.

Sub-goal-driven structured memory that folds multi-round traces into concise memory units and switches context between detailed traces and compressed summaries.

Key Findings

Yunque achieves top Pass@1 on three browsing/data benchmarks.

NumbersBrowseComp 62.5; BrowseComp-ZH 75.9; HLE 51.7

Practical UseUse Yunque's architecture to improve single-run correctness on web and data tasks compared to standard ReAct pipelines.

Evidence RefTable 1 (Main Results)

Yunque is second on GAIA and improves base model performance.

NumbersGAIA 78.6; Gemini 3 Pro gain +10.0 on BrowseComp, +4.8 on GAIA

Practical UseFramework-level design (not only model choice) can unlock substantial accuracy gains across backbones; try the framework layer before swapping LLMs.

Evidence RefTable 1 and Section 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass@1 (BrowseComp)	62.5	Best prior reported ~60.2 (Kimi K2 thinking / other baselines vary)	+~10.0 vs Gemini 3 Pro ReAct baseline	BrowseComp	Main Results Table 1	Table 1
Pass@1 (BrowseComp-ZH)	75.9	Prior baselines reported mid-60s	+8.8 vs Gemini 3 Pro ReAct (per Table 2)	BrowseComp-ZH	Main Results Table 1	Table 1

What To Try In 7 Days

Add sub-goal memory: fold completed subtasks into compact summaries to cut context noise.

Introduce a lightweight Supervisor to detect and prune failed traces before retries.

Prototype a Browser + Data Analysis sub-agent pair with a one-tool-per-turn constraint.

Agent Features

Memory

sub-goal-driven structured memory units (4-tuple)dynamic folding and compression of completed sub-goals

Planning

dynamic task decomposition into sub-goalsadaptive routing to tools or specialized sub-agents

Tool Use

Browser-Use GUI Agent (interactive web actions)Data Analysis Agent (code generation + execution)one-tool-per-turn execution constraint

Frameworks

open-source implementation (claimed)composable Atomic Capability Pool for plugging new tools

Is Agentic

Yes

Architectures

hierarchical multi-agent (Main Agent + sub-agents)centralized orchestration with atomic capability pool

Collaboration

Main Agent delegates to specialized sub-agentsSupervisor intervenes with interrupts and recovery

Optimization Features

Token Efficiency

PDF paging and incremental markdown conversion to avoid injecting full long docs

System Optimization

one-tool-per-turn policy to decompose complex actionscontext complexity reduction from O(t) to O(n) via sub-goal folding

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Risks & Boundaries

Limitations

Ablations on specialized sub-agents are mainly limited to GAIA and may not generalize to all domains.

No systematic measurement of token consumption, latency, or compute cost reported.

When Not To Use

Latency-sensitive production services where interactive browsing would add unacceptable delay.

Scenarios where the underlying LLM cannot reliably perform precise tool invocation.

Failure Modes

Cascading errors from malformed tool calls or repeated bad actions.

Context pollution where failed traces bias future reasoning.

Core Entities

Models

Gemini 3 ProDeepSeek-V3.2Kimi K2 ThinkingGLM 4.7GPT-5 HighClaude-4.5-SonnetOpenAI-o3

Metrics

Pass@1Pass@N

Datasets

GAIABrowseCompBrowseComp-ZHHumanity's Last Exam

Benchmarks

GAIABrowseCompBrowseComp-ZHHumanity's Last Exam

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Yunque achieves top Pass@1 on three browsing/data benchmarks.

Yunque is second on GAIA and improves base model performance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding