Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
10
Why It Matters For Business
LLMs can act as zero-shot coordination partners for tasks where the environment dictates the correct action (logistics routing, scripted multi-robot tasks), cutting training time; but they are unreliable when partner modeling or multi-step joint planning is required.
Summary TLDR
The paper introduces LLM-Coordination, a benchmark with two tasks: multi-turn Agentic Coordination (agents act inside games) and single-turn CoordinationQA (198 multiple-choice edge-case questions). Tested LLMs (GPT-4-turbo, GPT-4o, GPT-3.5-turbo, Mixtral) can match or beat RL on environment-driven coordination (Overcooked), are robust to unseen partners (zero-shot cross-play), but perform poorly when tasks require deep Theory-of-Mind (ToM) or joint planning (Hanabi and Joint Planning questions). Simple reasoning steps—explicit ToM inference and answer verification—reduce catastrophic mistakes and improve scores. Code is available. Practical takeaway: LLMs are promising for coordination when
Problem Statement
We lack a focused, comparative test of how current LLMs perform as coordination agents in pure-cooperation games. The paper asks: can LLMs act directly inside coordination environments, how do they compare to MARL baselines, and which component skills (environment reading, predicting partners' beliefs, joint planning) limit performance?
Main Contribution
A new LLM-Coordination benchmark with two settings: Agentic Coordination (LLMs act in four pure-coordination games) and CoordinationQA (198 multiple-choice edge-case questions).
A holistic empirical comparison of LLM-based agents versus multi-agent RL baselines in self-play and cross-play (zero-shot) scenarios.
A focused analysis that isolates three component skills—Environment Comprehension, Theory of Mind reasoning, and Joint Planning—and shows where LLMs succeed or fail.
Key Findings
LLM agents match or exceed RL on environment-driven Overcooked layouts.
LLMs fall far short of RL in Hanabi, a game needing deep partner-belief reasoning.
CoordinationQA shows strengths in environment reading but weaknesses in ToM and joint planning.
Explicit ToM reasoning and answer verification reduce catastrophic failures in Hanabi.
LLM agents are robust in zero-shot cross-play with unseen partners.
Results
Overcooked score (Asymmetric Advantages layout)
Hanabi score
CollabEscape capture rate / avg turns
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Plug a strong LLM (e.g., GPT-4-turbo) into a simple environment-driven coordination task and evaluate as a zero-shot partner.
Add a short verification step that rejects actions violating hard safety/rules to reduce catastrophic errors.
Create a small CoordinationQA-style test (edge cases) to measure environment understanding vs partner-modeling before deployment.
Agent Features
Memory
- Long-term (game rules/procedures)
- Working memory (current state text)
- Episodic memory (previous actions)
Planning
- Single-step action selection
- Explicit Theory-of-Mind reasoning step
- Answer verification before action
Tool Use
- Grounding module to map language to game actions
Frameworks
- ReAct
- Self-Verification
- Self-Consistency
- Cognitive Architectures for Language Agents
Is Agentic
true
Architectures
- LLM-based agent with cognitive architecture scaffold
Collaboration
- Self-play evaluation
- Cross-play / Zero-shot coordination
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High latency and compute for large LLMs make them unsuitable for real-time tasks.
- Prompt and procedural memory require manual configuration to get good behavior.
- CoordinationQA was manually curated, limiting scalability and introducing selection bias.
When Not To Use
- Real-time systems that need low-latency decisions.
- Tasks that require deep partner belief modeling or tight error margins.
- Resource-constrained deployments where LLM compute is infeasible.
Failure Modes
- Hallucinated actions that break game rules and cause catastrophic loss (Hanabi bombs).
- Poor joint planning leading to worse-than-random decisions on multi-step coordination.
- High latency causing missed action windows in time-sensitive environments.
Core Entities
Models
- GPT-4-turbo
- GPT-4o
- GPT-3.5-turbo
- Mixtral-8x7B
- PPO
- PBT
- BAD
- SAD
- Off-Belief Learning
- Behavior Cloning
- HSP
- PPO_BC
- OBL
Metrics
- Overcooked score (points per delivery)
- Hanabi score (cards played)
- Success rate (capture/escape)
- Average turns
- Accuracy
Datasets
- CoordinationQA (198 MCQs, 66 scenarios)
- Overcooked-AI layouts
- Hanabi Challenge
- CollabCapture
- CollabEscape
Benchmarks
- LLM-Coordination Benchmark
- CoordinationQA Suite

