VillagerAgent: use DAGs to coordinate LLM agents and a new VillagerBench in Minecraft

Overview

Decision SnapshotNeeds Validation

Modeling subtasks as a DAG is a clear practical step to reduce coordination errors; success depends on LLM quality, agent count, and API design.

Citations6

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 55%

Production readiness: 40%

Novelty: 60%

Authors

Yubo Dong, Xukun Zhu, Zhengzhe Pan, Linchao Zhu, Yi Yang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Modeling task dependencies explicitly with a DAG reduces coordination errors and token cost in LLM-driven multi-agent workflows, making automated team coordination cheaper and more reliable in simulated task domains.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO

Summary TLDR

The paper introduces VillagerBench, a new Minecraft benchmark with three multi-agent scenarios (construction, farm-to-table cooking, escape rooms), and VillagerAgent, a framework that decomposes tasks into a directed acyclic graph (DAG) to assign subtasks to LLM-driven base agents. On the benchmark, VillagerAgent + GPT-4 reduces hallucinations vs AgentVerse (18.2% vs 44.4%), cuts token cost (avg 1.79 vs 10.3), and achieves higher completion scores in cooking and Overcooked-AI transfer tests. Gains are scoped to the evaluated Minecraft scenarios; scaling beyond ~8 agents and varied agent abilities remain limitations. Code is publicly available.

Problem Statement

Existing multi-agent LLM systems struggle when tasks require mixed sequential and parallel steps, spatial/causal/temporal constraints, and dynamic role changes. We need a benchmark and a coordination method that explicitly models inter-subtask dependencies so agents can plan and synchronize correctly.

Main Contribution

VillagerBench: a Minecraft benchmark with three scenarios testing spatial, causal, and temporal dependencies.

VillagerAgent: a DAG-based multi-agent framework with Task Decomposer, Agent Controller, State Manager, and Base Agents.

Key Findings

VillagerAgent cuts hallucination-driven failures vs AgentVerse on cooking tasks.

NumbersFailure rate 18.2% (VillagerAgent) vs 44.4% (AgentVerse)

Practical UseUse a centralized DAG decomposer to reduce mistaken plans and avoid cascading execution errors in multi-agent workflows.

Evidence RefFigure 5

VillagerAgent greatly lowers token-based cost while improving scores.

NumbersAvg Token Cost 1.79 (VillagerAgent) vs 10.3 (AgentVerse)

Practical UseExpect lower billing per effective score when using DAG-driven task assignment, even if prompts are slightly longer.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Cooking task failure rate (hallucination-driven)	VillagerAgent 18.2% vs AgentVerse 44.4%	AgentVerse	−26.2 pp	Farm-to-Table Cooking	AgentVerse shows hallucination in discussion stage causing false handovers; VillagerAgent centralizes decomposition to avoid this.	Figure 5; Section 4.2
Average Token Cost	VillagerAgent avg 1.79 vs AgentVerse avg 10.3	AgentVerse	≈5.8× lower	VillagerBench (averaged difficulties)	Table 4 reports tokens and computed token cost per difficulty level.	Table 4

What To Try In 7 Days

Run VillagerAgent on a small two-agent task to compare hallucination rates vs your current agent pipeline.

Measure token cost per meaningful action when using a centralized task decomposer vs peer negotiation.

Limit agent count and test 2–4 agents to find the sweet spot for your task before scaling up.

Agent Features

Memory

Agent state as long-term memoryAction history as short-term memory

Planning

LLM-driven task decompositionZero-shot chain-of-thought for subtask generation

Tool Use

API-based Base Agents (placeBlock, mine, craft, etc.)

Frameworks

VillagerAgent (Task Decomposer, Agent Controller, State Manager, Base Agents)

Is Agentic

Yes

Architectures

DAG-based Task GraphCentral Agent Controller + Base Agents

Collaboration

Centralized task assignmentParallel execution of independent subtasks

Optimization Features

Token Efficiency

Trades slightly more prompt tokens for much lower token cost per score

System Optimization

Single prompt set reused across scenarios improves prompt transferability

Inference Optimization

Lower token cost per scored result via structured decomposition

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/cnsdqd-dyb/VillagerAgent

Data URLs

https://github.com/cnsdqd-dyb/VillagerAgent

Risks & Boundaries

Limitations

Low overall task completion rates in hard scenarios due to benchmark complexity.

Performance drops when scaling beyond ~8 agents because of context and coordination overhead.

When Not To Use

Real-world safety-critical systems without formal guarantees on hallucinations.

Large swarms (>8 agents) where communication and context length grow uncontrolled.

Failure Modes

Hallucinations in agent discussion leading to false actions.

Resource competition and bottlenecks as agent count increases.

Core Entities

Models

GPT-4-1106-previewGemini-ProGLM-4

Metrics

Completion (C)Efficiency (E)Balance (B)View Hit Rate (VHR)Agent Contribution Rate (ACR)

Datasets

VillagerBench

Benchmarks

VillagerBenchOvercooked-AI

Context Entities

Models

VoyagerMindAgentMetaGPT

Metrics

Collaboration Score (CoS)

Benchmarks

Overcooked-AI

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

VillagerAgent cuts hallucination-driven failures vs AgentVerse on cooking tasks.

VillagerAgent greatly lowers token-based cost while improving scores.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding