Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.55
Citation Count
6
Why It Matters For Business
Modeling task dependencies explicitly with a DAG reduces coordination errors and token cost in LLM-driven multi-agent workflows, making automated team coordination cheaper and more reliable in simulated task domains.
Summary TLDR
The paper introduces VillagerBench, a new Minecraft benchmark with three multi-agent scenarios (construction, farm-to-table cooking, escape rooms), and VillagerAgent, a framework that decomposes tasks into a directed acyclic graph (DAG) to assign subtasks to LLM-driven base agents. On the benchmark, VillagerAgent + GPT-4 reduces hallucinations vs AgentVerse (18.2% vs 44.4%), cuts token cost (avg 1.79 vs 10.3), and achieves higher completion scores in cooking and Overcooked-AI transfer tests. Gains are scoped to the evaluated Minecraft scenarios; scaling beyond ~8 agents and varied agent abilities remain limitations. Code is publicly available.
Problem Statement
Existing multi-agent LLM systems struggle when tasks require mixed sequential and parallel steps, spatial/causal/temporal constraints, and dynamic role changes. We need a benchmark and a coordination method that explicitly models inter-subtask dependencies so agents can plan and synchronize correctly.
Main Contribution
VillagerBench: a Minecraft benchmark with three scenarios testing spatial, causal, and temporal dependencies.
VillagerAgent: a DAG-based multi-agent framework with Task Decomposer, Agent Controller, State Manager, and Base Agents.
Empirical evaluation showing VillagerAgent outperforms prior frameworks (AgentVerse, ProAgent) on the benchmark and transfers to Overcooked-AI.
Open-source implementation on GitHub.
Key Findings
VillagerAgent cuts hallucination-driven failures vs AgentVerse on cooking tasks.
VillagerAgent greatly lowers token-based cost while improving scores.
GPT-4 paired with VillagerAgent produced the best benchmark performance.
Adding agents helps up to a point, then harms performance.
Heterogeneous agent abilities reduced coordination effectiveness.
Results
Cooking task failure rate (hallucination-driven)
Average Token Cost
Cooking completion (C) with GPT-4
Construction completion (C) with GPT-4
Agent count effect on performance
Who Should Care
What To Try In 7 Days
Run VillagerAgent on a small two-agent task to compare hallucination rates vs your current agent pipeline.
Measure token cost per meaningful action when using a centralized task decomposer vs peer negotiation.
Limit agent count and test 2–4 agents to find the sweet spot for your task before scaling up.
Agent Features
Memory
- Agent state as long-term memory
- Action history as short-term memory
Planning
- LLM-driven task decomposition
- Zero-shot chain-of-thought for subtask generation
Tool Use
- API-based Base Agents (placeBlock, mine, craft, etc.)
Frameworks
- VillagerAgent (Task Decomposer, Agent Controller, State Manager, Base Agents)
Is Agentic
true
Architectures
- DAG-based Task Graph
- Central Agent Controller + Base Agents
Collaboration
- Centralized task assignment
- Parallel execution of independent subtasks
Optimization Features
Token Efficiency
- Trades slightly more prompt tokens for much lower token cost per score
System Optimization
- Single prompt set reused across scenarios improves prompt transferability
Inference Optimization
- Lower token cost per scored result via structured decomposition
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Low overall task completion rates in hard scenarios due to benchmark complexity.
- Performance drops when scaling beyond ~8 agents because of context and coordination overhead.
- Agents with diverse APIs perform worse without extra coordination logic.
- Results are evaluated only in simulated Minecraft/Overcooked environments, not physical robotics.
When Not To Use
- Real-world safety-critical systems without formal guarantees on hallucinations.
- Large swarms (>8 agents) where communication and context length grow uncontrolled.
- Tasks requiring strict real-time guarantees or hardware-level control.
Failure Modes
- Hallucinations in agent discussion leading to false actions.
- Resource competition and bottlenecks as agent count increases.
- Long prompts and context causing LLM timeouts or degraded reasoning.
Core Entities
Models
- GPT-4-1106-preview
- Gemini-Pro
- GLM-4
Metrics
- Completion (C)
- Efficiency (E)
- Balance (B)
- View Hit Rate (VHR)
- Agent Contribution Rate (ACR)
Datasets
- VillagerBench
Benchmarks
- VillagerBench
- Overcooked-AI
Context Entities
Models
- Voyager
- MindAgent
- MetaGPT
Metrics
- Collaboration Score (CoS)
Benchmarks
- Overcooked-AI

