VillagerAgent: use DAGs to coordinate LLM agents and a new VillagerBench in Minecraft

June 9, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.55

Citation Count

6

Authors

Yubo Dong, Xukun Zhu, Zhengzhe Pan, Linchao Zhu, Yi Yang

Links

Abstract / PDF

Why It Matters For Business

Modeling task dependencies explicitly with a DAG reduces coordination errors and token cost in LLM-driven multi-agent workflows, making automated team coordination cheaper and more reliable in simulated task domains.

Summary TLDR

The paper introduces VillagerBench, a new Minecraft benchmark with three multi-agent scenarios (construction, farm-to-table cooking, escape rooms), and VillagerAgent, a framework that decomposes tasks into a directed acyclic graph (DAG) to assign subtasks to LLM-driven base agents. On the benchmark, VillagerAgent + GPT-4 reduces hallucinations vs AgentVerse (18.2% vs 44.4%), cuts token cost (avg 1.79 vs 10.3), and achieves higher completion scores in cooking and Overcooked-AI transfer tests. Gains are scoped to the evaluated Minecraft scenarios; scaling beyond ~8 agents and varied agent abilities remain limitations. Code is publicly available.

Problem Statement

Existing multi-agent LLM systems struggle when tasks require mixed sequential and parallel steps, spatial/causal/temporal constraints, and dynamic role changes. We need a benchmark and a coordination method that explicitly models inter-subtask dependencies so agents can plan and synchronize correctly.

Main Contribution

VillagerBench: a Minecraft benchmark with three scenarios testing spatial, causal, and temporal dependencies.

VillagerAgent: a DAG-based multi-agent framework with Task Decomposer, Agent Controller, State Manager, and Base Agents.

Empirical evaluation showing VillagerAgent outperforms prior frameworks (AgentVerse, ProAgent) on the benchmark and transfers to Overcooked-AI.

Open-source implementation on GitHub.

Key Findings

VillagerAgent cuts hallucination-driven failures vs AgentVerse on cooking tasks.

NumbersFailure rate 18.2% (VillagerAgent) vs 44.4% (AgentVerse)

VillagerAgent greatly lowers token-based cost while improving scores.

NumbersAvg Token Cost 1.79 (VillagerAgent) vs 10.3 (AgentVerse)

GPT-4 paired with VillagerAgent produced the best benchmark performance.

NumbersCooking C: 73.75% (2-agent) and 85.26% (3-agent) with GPT-4

Adding agents helps up to a point, then harms performance.

NumbersConstruction task peak then drop: e.g., Task 0 C 100% (1-4 agents) to 66.63% (8 agents)

Heterogeneous agent abilities reduced coordination effectiveness.

NumbersFarm-to-table C: 56.67% (same abilities) vs 36.67% (diverse abilities)

Results

Cooking task failure rate (hallucination-driven)

ValueVillagerAgent 18.2% vs AgentVerse 44.4%

BaselineAgentVerse

Average Token Cost

ValueVillagerAgent avg 1.79 vs AgentVerse avg 10.3

BaselineAgentVerse

Cooking completion (C) with GPT-4

Value73.75% (2 agents) and 85.26% (3 agents)

BaselineAgentVerse GPT 29.75%

Construction completion (C) with GPT-4

Value36.45% (2 agents) and 52.17% (3 agents)

BaselineGemini-Pro 8.12%

Agent count effect on performance

ValuePeak performance at moderate agent numbers; declines above 4–8 agents

Who Should Care

What To Try In 7 Days

Run VillagerAgent on a small two-agent task to compare hallucination rates vs your current agent pipeline.

Measure token cost per meaningful action when using a centralized task decomposer vs peer negotiation.

Limit agent count and test 2–4 agents to find the sweet spot for your task before scaling up.

Agent Features

Memory

  • Agent state as long-term memory
  • Action history as short-term memory

Planning

  • LLM-driven task decomposition
  • Zero-shot chain-of-thought for subtask generation

Tool Use

  • API-based Base Agents (placeBlock, mine, craft, etc.)

Frameworks

  • VillagerAgent (Task Decomposer, Agent Controller, State Manager, Base Agents)

Is Agentic

true

Architectures

  • DAG-based Task Graph
  • Central Agent Controller + Base Agents

Collaboration

  • Centralized task assignment
  • Parallel execution of independent subtasks

Optimization Features

Token Efficiency

  • Trades slightly more prompt tokens for much lower token cost per score

System Optimization

  • Single prompt set reused across scenarios improves prompt transferability

Inference Optimization

  • Lower token cost per scored result via structured decomposition

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Low overall task completion rates in hard scenarios due to benchmark complexity.
  • Performance drops when scaling beyond ~8 agents because of context and coordination overhead.
  • Agents with diverse APIs perform worse without extra coordination logic.
  • Results are evaluated only in simulated Minecraft/Overcooked environments, not physical robotics.

When Not To Use

  • Real-world safety-critical systems without formal guarantees on hallucinations.
  • Large swarms (>8 agents) where communication and context length grow uncontrolled.
  • Tasks requiring strict real-time guarantees or hardware-level control.

Failure Modes

  • Hallucinations in agent discussion leading to false actions.
  • Resource competition and bottlenecks as agent count increases.
  • Long prompts and context causing LLM timeouts or degraded reasoning.

Core Entities

Models

  • GPT-4-1106-preview
  • Gemini-Pro
  • GLM-4

Metrics

  • Completion (C)
  • Efficiency (E)
  • Balance (B)
  • View Hit Rate (VHR)
  • Agent Contribution Rate (ACR)

Datasets

  • VillagerBench

Benchmarks

  • VillagerBench
  • Overcooked-AI

Context Entities

Models

  • Voyager
  • MindAgent
  • MetaGPT

Metrics

  • Collaboration Score (CoS)

Benchmarks

  • Overcooked-AI