VillagerAgent: use DAGs to coordinate LLM agents and a new VillagerBench in Minecraft

June 9, 20247 min

Overview

Decision SnapshotNeeds Validation

Modeling subtasks as a DAG is a clear practical step to reduce coordination errors; success depends on LLM quality, agent count, and API design.

Citations6

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 55%

Production readiness: 40%

Novelty: 60%

Authors

Yubo Dong, Xukun Zhu, Zhengzhe Pan, Linchao Zhu, Yi Yang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Modeling task dependencies explicitly with a DAG reduces coordination errors and token cost in LLM-driven multi-agent workflows, making automated team coordination cheaper and more reliable in simulated task domains.

Who Should Care

Summary TLDR

The paper introduces VillagerBench, a new Minecraft benchmark with three multi-agent scenarios (construction, farm-to-table cooking, escape rooms), and VillagerAgent, a framework that decomposes tasks into a directed acyclic graph (DAG) to assign subtasks to LLM-driven base agents. On the benchmark, VillagerAgent + GPT-4 reduces hallucinations vs AgentVerse (18.2% vs 44.4%), cuts token cost (avg 1.79 vs 10.3), and achieves higher completion scores in cooking and Overcooked-AI transfer tests. Gains are scoped to the evaluated Minecraft scenarios; scaling beyond ~8 agents and varied agent abilities remain limitations. Code is publicly available.

Problem Statement

Existing multi-agent LLM systems struggle when tasks require mixed sequential and parallel steps, spatial/causal/temporal constraints, and dynamic role changes. We need a benchmark and a coordination method that explicitly models inter-subtask dependencies so agents can plan and synchronize correctly.

Main Contribution

VillagerBench: a Minecraft benchmark with three scenarios testing spatial, causal, and temporal dependencies.

VillagerAgent: a DAG-based multi-agent framework with Task Decomposer, Agent Controller, State Manager, and Base Agents.

Key Findings

VillagerAgent cuts hallucination-driven failures vs AgentVerse on cooking tasks.

NumbersFailure rate 18.2% (VillagerAgent) vs 44.4% (AgentVerse)

Practical UseUse a centralized DAG decomposer to reduce mistaken plans and avoid cascading execution errors in multi-agent workflows.

Evidence RefFigure 5

VillagerAgent greatly lowers token-based cost while improving scores.

NumbersAvg Token Cost 1.79 (VillagerAgent) vs 10.3 (AgentVerse)

Practical UseExpect lower billing per effective score when using DAG-driven task assignment, even if prompts are slightly longer.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Cooking task failure rate (hallucination-driven)VillagerAgent 18.2% vs AgentVerse 44.4%AgentVerse−26.2 ppFarm-to-Table CookingAgentVerse shows hallucination in discussion stage causing false handovers; VillagerAgent centralizes decomposition to avoid this.Figure 5; Section 4.2
Average Token CostVillagerAgent avg 1.79 vs AgentVerse avg 10.3AgentVerse≈5.8× lowerVillagerBench (averaged difficulties)Table 4 reports tokens and computed token cost per difficulty level.Table 4

What To Try In 7 Days

Run VillagerAgent on a small two-agent task to compare hallucination rates vs your current agent pipeline.

Measure token cost per meaningful action when using a centralized task decomposer vs peer negotiation.

Limit agent count and test 2–4 agents to find the sweet spot for your task before scaling up.

Agent Features

Memory
Agent state as long-term memoryAction history as short-term memory
Planning
LLM-driven task decompositionZero-shot chain-of-thought for subtask generation
Tool Use
API-based Base Agents (placeBlock, mine, craft, etc.)
Frameworks
VillagerAgent (Task Decomposer, Agent Controller, State Manager, Base Agents)
Is Agentic

Yes

Architectures
DAG-based Task GraphCentral Agent Controller + Base Agents
Collaboration
Centralized task assignmentParallel execution of independent subtasks

Optimization Features

Token Efficiency
Trades slightly more prompt tokens for much lower token cost per score
System Optimization
Single prompt set reused across scenarios improves prompt transferability
Inference Optimization
Lower token cost per scored result via structured decomposition

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Low overall task completion rates in hard scenarios due to benchmark complexity.

Performance drops when scaling beyond ~8 agents because of context and coordination overhead.

When Not To Use

Real-world safety-critical systems without formal guarantees on hallucinations.

Large swarms (>8 agents) where communication and context length grow uncontrolled.

Failure Modes

Hallucinations in agent discussion leading to false actions.

Resource competition and bottlenecks as agent count increases.

Core Entities

Models

GPT-4-1106-previewGemini-ProGLM-4

Metrics

Completion (C)Efficiency (E)Balance (B)View Hit Rate (VHR)Agent Contribution Rate (ACR)

Datasets

VillagerBench

Benchmarks

VillagerBenchOvercooked-AI

Context Entities

Models

VoyagerMindAgentMetaGPT

Metrics

Collaboration Score (CoS)

Benchmarks

Overcooked-AI