Hierarchical Agents Papers — Parsed & Scored for Practitioners

Autonomously collect a single rollout to train a NeRF for rendering, mapping and navigation

0.60

0.50

0.40

5

AutoNeRF can automate 3D scene capture for robot deployment, cutting manual data collection and enabling safe simulation-based finetuning of navigation policies from a single short rollout.

Key finding

Modular exploration trained for obstacle/viewpoint coverage yields better RGB rendering than Frontier or E2E RL.

Numbers: PSNR 25.56 (Ours obs.) vs 19.75 (Frontier) on uniform scene poses

Agent-E: hierarchical web agent with DOM denoising and change-observation — 73.2% on WebVoyager

0.45

0.60

0.50

2

A hierarchical agent with DOM denoising and action feedback raises generic web automation success to ~73% and gives actionable signals (self-aware failures) that support safe fallbacks and learning pipelines.

Key finding

Agent-E reached 73.2% task success on the WebVoyager benchmark.

Numbers: 73.2% overall success (WebVoyager)

CH-MARL: hierarchical multi-agent RL with real-time constraint enforcement to cut emissions and balance costs in maritime logistics

0.40

0.60

1

CH-MARL offers a practical way to meet emission caps while coordinating many vessels; it can reduce fuel-related emissions and help comply with regulations at modest engineering cost, but needs pilot testing and constraint tuning before real deployment.

Key finding

CH-MARL variants delivered lower cumulative emissions in the digital twin compared to the baseline.

Numbers: Run A 4.7304 → Run D 4.07152 (−0.6589, −13.9%)

Use temporal contrastive embeddings + goal-conditioned policies to transfer multi-agent skills and generate sub-goals

0.40

0.60

0.70

1

If you reuse pre-trained goal-conditioned policies and learn temporal sub-goals, you can drastically cut simulator training costs and reach equal or better coordinated performance on new multi-agent tasks.

Key finding

Large sample savings: method reaches convergence faster than baselines on Overcooked transfers.

Numbers: Average 4.6× faster convergence than fastest baselines (reported)

AnyTool: GPT-4 agent that searches 16k+ APIs via hierarchical retrieval and self-reflection

0.60

0.70

0.50

1

AnyTool automates picking and calling hundreds of real APIs without extra model training, so teams can prototype API-heavy automation faster; expect higher success rates but plan for high token and API-call costs.

Key finding

AnyTool substantially improves real-task pass rates over prior systems.

Numbers: +35.4% average pass rate vs ToolLLM on ToolBench (as reported)

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

0.60

0.50

1

HGOT reduces fact errors by structuring multi-step retrieval and scoring evidence; this can raise factual accuracy in verification and QA systems with modest engineering and tuning.

Key finding

HGOT increases FEVER exact-match (EM) accuracy versus baselines.

Numbers: HGOT+Sampling EM 61.50% vs Retrieve-then-Read 58.35% (Overall)

Let a CEO→Manager→Worker hierarchy auto-write better prompts and improve zero-shot LLM outputs

0.60

0.50

1

HMAW automates prompt tuning without training and boosts response quality across varied tasks, letting teams improve outputs quickly while avoiding dataset-specific finetuning.

Key finding

Average preference score across five tasks increases by 30.7 percentage points

Numbers: Avg pref: 69.2% (HMAW) vs 38.5% (no prompt); +30.7 pts

CoAct: a two-tier global planner + local executor that improves long-horizon web task success

0.40

0.60

0.30

1

A simple global/local agent split can reduce looped interactions and improve automation success on multi-step web tasks, making web automation more robust with modest engineering effort.

Key finding

CoAct raises average task success vs ReAct on WebArena.

Numbers: Avg SR: ReAct 9.4% → CoAct 13.8% (+4.4pp, +47%)

Keep agent context small forever by storing task state as files — proving more stable long-run behavior for research workflows

0.60

0.70

0.60

0

If your workflows involve long document processing or multi-step knowledge work, state management matters more than raw model size. A file-centric agent design can make smaller, cheaper models far more reliable over long runs and reduce costly re-runs.

Key finding

InfiAgent (gpt-oss-20b) scores 41.45 on DeepResearch using no task-specific fine-tuning.

Numbers: DeepResearch overall = 41.45 (Table 2)

NEXUSSUM: a three-agent LLM pipeline that converts dialogue, chunks scenes, and iteratively compresses to summarize books, movies, and TV

0.70

0.60

0

NEXUSSUM lets teams produce accurate, length-controlled summaries of very long narratives without model finetuning, improving semantic quality and factuality for products like content discovery, synopsis generation, and archival indexing.

Key finding

NEXUSSUM achieves large semantic gains on long narratives, especially books.

Numbers: BookSum: +30.0% BERTScore (F1) vs CachED

HiMPo: learn feudal (multi-level) message-passing policies and use upper-level advantage signals to train lower levels.

0.60

0.65

0.40

0

If your product uses many cooperating agents (robot fleets, multi-robot exploration, distributed sensors), HiMPo offers a way to combine hierarchical planning with local message-passing. That reduces greedy/short-term behaviour and improves coordination without handcrafting low-level rewards.

Key finding

HiMPPO sustains coordinated, non-greedy team strategies on a hard cooperative foraging task when baselines fail.

Numbers: LBFwS: 10 agents; experiments averaged over 8 runs; on LBFwS-Hard only HiMPPO avoided greedy individual play (Fig.2).

Make each agent update 'anticipate' the other agents' simultaneous updates to speed up coordination.

0.45

0.75

0.60

0

If your product uses cooperative multi-agent learning (robot teams, traffic control, game AI), KPG can meaningfully improve coordination and success rates at the cost of extra compute. The net business trade is faster convergence and higher task success versus ~25–30% extra runtime for the practical default (k=2).

Key finding

KPG with finite k improves empirical performance across multiple cooperative benchmarks.

Numbers: K2-FACMAC: +114% (MAMuJoCo), +98% (SMAC) vs FACMAC (Table 1).

STRATEGIST: LLMs learn and refine high-level strategies with bi-level tree search and self-play

0.60

0.75

0.60

0

STRATEGIST shows you can get usable, human-competitive strategies from LLMs without labeled training data by pairing LLM-written strategy text with search and simulated self-play, speeding prototyping of strategic agents and negotiation systems.

Key finding

STRATEGIST generated higher-quality value heuristics and dialogue guides than four LLM self-improvement baselines on the evaluated games.

Numbers: GOPS value heuristic: +1.5 ±0.99 vs best baseline 0.092 ±0.67; Avalon Merlin guide: 0.88 ±0.063 vs baseline ≤0.62 (Table

CityEQA-EC benchmark plus PMA: a hierarchical LLM agent that explores simulated cities to answer open‑vocabulary questions

0.30

0.65

0.35

0

CityEQA-EC and PMA provide a practical testbed for building drone/UAV perception and urban-inspection agents that use language-guided planning and map memory, reducing search time and distance vs naive exploration.

Key finding

CityEQA-EC contains 1,412 validated tasks across six task types.

Numbers: 1,412 tasks (final dataset)

Use a central controller + simple planner and Options to make multi-agent Q‑learning learn faster on grid tasks

0.30

0.50

0.40

0

Decomposing multi-agent tasks and adding a cheap planner can speed learning with simple algorithms, reducing training time and compute for small structured problems.

Key finding

Q‑learning with Options produced the highest average reward in test runs compared to plain Q‑learning and random policy.

Replace slow online search with learned hierarchical agents to cut responder reallocation time from minutes to fractions of a second

0.70

0.60

0.70

0

Switching from search-based planners to learned hierarchical agents delivers sub-second reallocation decisions and modestly shorter ambulance response times on real-city data, enabling practical real-time deployment and lower operational latency.

Key finding

Decision latency cut from minutes to fractions of a second.

Numbers: 0.22s per decision vs 3 min (≈180s) for MCTS

A multi-agent RL leaf sequencer that reconstructs fluence maps and speeds optimizer convergence

0.60

0.70

0.60

0

RLS can shorten planning iterations and produce executable plans faster by replacing an iterative leaf sequencer, potentially cutting planning time and compute cost in automated radiotherapy pipelines.

Key finding

RLS reduces fluence reconstruction error on head-and-neck data.

Numbers: HNd MNSE: PORIx 0.219 → RLS 0.149 (−0.070)

Use hierarchical contrastive consensus to give decentralized agents an emergent global signal and improve multi-robot cooperation

0.60

0

HC-MARL gives decentralized robots a cheap, training-time way to infer group context without runtime communication, improving task speed and coordination which reduces mission time and energy in multi-robot systems.

Key finding

HC-MARL raises episode rewards in Navigation tasks compared with MAPPO/HAPPO.

Numbers: ≈20% higher reward (3 agents); ≈35% higher reward (10 agents)

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

0.40

0.65

0.50

0

Formal symbolic rules combined with LMs reduce risky hallucinations in diagnostics, making automated troubleshooting auditable and safer for critical infrastructure.

Key finding

The system produced correct end-to-end diagnoses on all simulated test scenarios.

Numbers: 3/3 scenarios solved (cascading, direct causal, confounded)

StackPlanner: centralized coordinator + active task stack + reusable experience memory for stable long-horizon multi-agent collaboration

0.60

0

Active memory control and reusable experience reduce error propagation in multi-agent workflows, improving reliability and reuse across tasks so teams get better multi-step outputs with fewer retries.

Key finding

StackPlanner yields higher F1 than prior agentic RL baselines on multi-hop QA.

Numbers: 2Wiki F1 32.92% (Ours, Qwen2.5-3B) vs 29.55% (ARPO); +3.37 pts

A hierarchical multi-agent ESG analyst plus a 3-level benchmark built from 310 corporate sustainability reports

0.60

0.50

0.60

0

Automating professional-grade ESG audits requires retrieval, web research, and domain tools; a specialized agent produces more verifiable and visualization-rich reports than off-the-shelf LLMs, improving auditability and decision support.

Key finding

ESGAgent achieves 84.15% overall accuracy on Level 1–2 tasks, outperforming Gemini-3-flash (80.89%).

Numbers: Total Acc 84.15% vs 80.89% (Table 3)

Use RL over editable outlines to plan and draft long scientific texts with better structure and citation fidelity

0.50

0.70

0.60

0

Modeling writing as iterative edit planning improves global structure and citation fidelity. This can cut human editing time and make automated drafting more controllable for long documents.

Key finding

Fine-tuned models using the OutlineForge pipeline beat one-shot outline baselines on evaluated survey generation.

Numbers: Phi-3.8B F1=0.422 vs SurveyForge F1≈0.313 (200 steps)

Hierarchical Cognitive Caching lets an agent sustain multi-day ML experiments and improve results

0.50

0.70

0.60

0

Hierarchical caching cuts token costs and raises success rates for long-running ML automation, reducing expensive manual cycles and accelerating model development.

Key finding

ML-Master 2.0 achieves a 56.44% average medal rate on MLE-Bench under a 24-hour budget.

Numbers: 56.44% avg medal rate (MLE-Bench, 24h)

AutoRefine: automatically extract reusable skills and subagents from past runs to continually improve LLM agents

0.60

0.65

0.60

0

AutoRefine automates turning past successful runs into reusable procedures and small specialized agents, which lowers action counts and improves success on complex planning—reducing latency and operation cost for task-heavy, tool-using products.

Key finding

AutoRefine improves success rates across diverse benchmarks.

Numbers: ALFWorld 98.4% ±1.5, ScienceWorld 70.4% ±1.9, TravelPlanner 27.1% ±2.4