Overview
Modeling subtasks as a DAG is a clear practical step to reduce coordination errors; success depends on LLM quality, agent count, and API design.
Citations6
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 55%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Modeling task dependencies explicitly with a DAG reduces coordination errors and token cost in LLM-driven multi-agent workflows, making automated team coordination cheaper and more reliable in simulated task domains.
Who Should Care
Summary TLDR
The paper introduces VillagerBench, a new Minecraft benchmark with three multi-agent scenarios (construction, farm-to-table cooking, escape rooms), and VillagerAgent, a framework that decomposes tasks into a directed acyclic graph (DAG) to assign subtasks to LLM-driven base agents. On the benchmark, VillagerAgent + GPT-4 reduces hallucinations vs AgentVerse (18.2% vs 44.4%), cuts token cost (avg 1.79 vs 10.3), and achieves higher completion scores in cooking and Overcooked-AI transfer tests. Gains are scoped to the evaluated Minecraft scenarios; scaling beyond ~8 agents and varied agent abilities remain limitations. Code is publicly available.
Problem Statement
Existing multi-agent LLM systems struggle when tasks require mixed sequential and parallel steps, spatial/causal/temporal constraints, and dynamic role changes. We need a benchmark and a coordination method that explicitly models inter-subtask dependencies so agents can plan and synchronize correctly.
Main Contribution
VillagerBench: a Minecraft benchmark with three scenarios testing spatial, causal, and temporal dependencies.
VillagerAgent: a DAG-based multi-agent framework with Task Decomposer, Agent Controller, State Manager, and Base Agents.
Key Findings
VillagerAgent cuts hallucination-driven failures vs AgentVerse on cooking tasks.
VillagerAgent greatly lowers token-based cost while improving scores.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Cooking task failure rate (hallucination-driven) | VillagerAgent 18.2% vs AgentVerse 44.4% | AgentVerse | −26.2 pp | Farm-to-Table Cooking | AgentVerse shows hallucination in discussion stage causing false handovers; VillagerAgent centralizes decomposition to avoid this. | Figure 5; Section 4.2 |
| Average Token Cost | VillagerAgent avg 1.79 vs AgentVerse avg 10.3 | AgentVerse | ≈5.8× lower | VillagerBench (averaged difficulties) | Table 4 reports tokens and computed token cost per difficulty level. | Table 4 |
What To Try In 7 Days
Run VillagerAgent on a small two-agent task to compare hallucination rates vs your current agent pipeline.
Measure token cost per meaningful action when using a centralized task decomposer vs peer negotiation.
Limit agent count and test 2–4 agents to find the sweet spot for your task before scaling up.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Low overall task completion rates in hard scenarios due to benchmark complexity.
Performance drops when scaling beyond ~8 agents because of context and coordination overhead.
When Not To Use
Real-world safety-critical systems without formal guarantees on hallucinations.
Large swarms (>8 agents) where communication and context length grow uncontrolled.
Failure Modes
Hallucinations in agent discussion leading to false actions.
Resource competition and bottlenecks as agent count increases.

