Overview
LLMs can be used off-the-shelf for environment-heavy coordination but require extra verification and are too slow and costly for real-time, safety-critical multi-agent deployment.
Citations10
Evidence Strength0.75
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
LLMs can act as zero-shot coordination partners for tasks where the environment dictates the correct action (logistics routing, scripted multi-robot tasks), cutting training time; but they are unreliable when partner modeling or multi-step joint planning is required.
Who Should Care
Summary TLDR
The paper introduces LLM-Coordination, a benchmark with two tasks: multi-turn Agentic Coordination (agents act inside games) and single-turn CoordinationQA (198 multiple-choice edge-case questions). Tested LLMs (GPT-4-turbo, GPT-4o, GPT-3.5-turbo, Mixtral) can match or beat RL on environment-driven coordination (Overcooked), are robust to unseen partners (zero-shot cross-play), but perform poorly when tasks require deep Theory-of-Mind (ToM) or joint planning (Hanabi and Joint Planning questions). Simple reasoning steps—explicit ToM inference and answer verification—reduce catastrophic mistakes and improve scores. Code is available. Practical takeaway: LLMs are promising for coordination when
Problem Statement
We lack a focused, comparative test of how current LLMs perform as coordination agents in pure-cooperation games. The paper asks: can LLMs act directly inside coordination environments, how do they compare to MARL baselines, and which component skills (environment reading, predicting partners' beliefs, joint planning) limit performance?
Main Contribution
A new LLM-Coordination benchmark with two settings: Agentic Coordination (LLMs act in four pure-coordination games) and CoordinationQA (198 multiple-choice edge-case questions).
A holistic empirical comparison of LLM-based agents versus multi-agent RL baselines in self-play and cross-play (zero-shot) scenarios.
Key Findings
LLM agents match or exceed RL on environment-driven Overcooked layouts.
LLMs fall far short of RL in Hanabi, a game needing deep partner-belief reasoning.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Overcooked score (Asymmetric Advantages layout) | GPT-4-turbo 260 ±11.55 | PBT 190.1 ±8.64 | +~70 | Overcooked AA layout | GPT-4-turbo outperforms PBT in AA layout | Table 1 |
| Hanabi score | GPT-4-turbo 13.33 ±0.88 | Off-Belief Learning 24.10 ±0.01 | -~10.8 | Hanabi Challenge | LLM lags behind RL baselines in Hanabi | Table 3 |
What To Try In 7 Days
Plug a strong LLM (e.g., GPT-4-turbo) into a simple environment-driven coordination task and evaluate as a zero-shot partner.
Add a short verification step that rejects actions violating hard safety/rules to reduce catastrophic errors.
Create a small CoordinationQA-style test (edge cases) to measure environment understanding vs partner-modeling before deployment.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
High latency and compute for large LLMs make them unsuitable for real-time tasks.
Prompt and procedural memory require manual configuration to get good behavior.
When Not To Use
Real-time systems that need low-latency decisions.
Tasks that require deep partner belief modeling or tight error margins.
Failure Modes
Hallucinated actions that break game rules and cause catastrophic loss (Hanabi bombs).
Poor joint planning leading to worse-than-random decisions on multi-step coordination.

