A benchmark showing LLMs can coordinate by reading environments but struggle at partners' beliefs and joint planning

Overview

Decision SnapshotNeeds Validation

LLMs can be used off-the-shelf for environment-heavy coordination but require extra verification and are too slow and costly for real-time, safety-critical multi-agent deployment.

Citations10

Evidence Strength0.75

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Saaket Agashe, Yue Fan, Anthony Reyna, Xin Eric Wang

Links

Abstract / PDF / Code

Why It Matters For Business

LLMs can act as zero-shot coordination partners for tasks where the environment dictates the correct action (logistics routing, scripted multi-robot tasks), cutting training time; but they are unreliable when partner modeling or multi-step joint planning is required.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

The paper introduces LLM-Coordination, a benchmark with two tasks: multi-turn Agentic Coordination (agents act inside games) and single-turn CoordinationQA (198 multiple-choice edge-case questions). Tested LLMs (GPT-4-turbo, GPT-4o, GPT-3.5-turbo, Mixtral) can match or beat RL on environment-driven coordination (Overcooked), are robust to unseen partners (zero-shot cross-play), but perform poorly when tasks require deep Theory-of-Mind (ToM) or joint planning (Hanabi and Joint Planning questions). Simple reasoning steps—explicit ToM inference and answer verification—reduce catastrophic mistakes and improve scores. Code is available. Practical takeaway: LLMs are promising for coordination when

Problem Statement

We lack a focused, comparative test of how current LLMs perform as coordination agents in pure-cooperation games. The paper asks: can LLMs act directly inside coordination environments, how do they compare to MARL baselines, and which component skills (environment reading, predicting partners' beliefs, joint planning) limit performance?

Main Contribution

A new LLM-Coordination benchmark with two settings: Agentic Coordination (LLMs act in four pure-coordination games) and CoordinationQA (198 multiple-choice edge-case questions).

A holistic empirical comparison of LLM-based agents versus multi-agent RL baselines in self-play and cross-play (zero-shot) scenarios.

Key Findings

LLM agents match or exceed RL on environment-driven Overcooked layouts.

NumbersGPT-4-turbo: 260 (AA layout) vs PBT: 190 (Table 1)

Practical UseUse LLM agents for coordination tasks dominated by observable environment state and role assignment; expect competitive performance without game-specific training.

Evidence RefTable 1

LLMs fall far short of RL in Hanabi, a game needing deep partner-belief reasoning.

NumbersGPT-4-turbo: 13.33 ±0.88 vs RL baselines ≈24 (Table 3)

Practical UseAvoid deploying current LLM agents where implicit partner beliefs and tight error margins matter; RL or specialized methods still better for such tasks.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overcooked score (Asymmetric Advantages layout)	GPT-4-turbo 260 ±11.55	PBT 190.1 ±8.64	+~70	Overcooked AA layout	GPT-4-turbo outperforms PBT in AA layout	Table 1
Hanabi score	GPT-4-turbo 13.33 ±0.88	Off-Belief Learning 24.10 ±0.01	-~10.8	Hanabi Challenge	LLM lags behind RL baselines in Hanabi	Table 3

What To Try In 7 Days

Plug a strong LLM (e.g., GPT-4-turbo) into a simple environment-driven coordination task and evaluate as a zero-shot partner.

Add a short verification step that rejects actions violating hard safety/rules to reduce catastrophic errors.

Create a small CoordinationQA-style test (edge cases) to measure environment understanding vs partner-modeling before deployment.

Agent Features

Memory

Long-term (game rules/procedures)Working memory (current state text)Episodic memory (previous actions)

Planning

Single-step action selectionExplicit Theory-of-Mind reasoning stepAnswer verification before action

Tool Use

Grounding module to map language to game actions

Frameworks

ReActSelf-VerificationSelf-ConsistencyCognitive Architectures for Language Agents

Is Agentic

Yes

Architectures

LLM-based agent with cognitive architecture scaffold

Collaboration

Self-play evaluationCross-play / Zero-shot coordination

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/eric-ai-lab/llm_coordination

Risks & Boundaries

Limitations

High latency and compute for large LLMs make them unsuitable for real-time tasks.

Prompt and procedural memory require manual configuration to get good behavior.

When Not To Use

Real-time systems that need low-latency decisions.

Tasks that require deep partner belief modeling or tight error margins.

Failure Modes

Hallucinated actions that break game rules and cause catastrophic loss (Hanabi bombs).

Poor joint planning leading to worse-than-random decisions on multi-step coordination.

Core Entities

Models

GPT-4-turboGPT-4oGPT-3.5-turboMixtral-8x7BPPOPBTBADSADOff-Belief LearningBehavior CloningHSPPPO_BCOBL

Metrics

Overcooked score (points per delivery)Hanabi score (cards played)Success rate (capture/escape)Average turnsAccuracy

Datasets

CoordinationQA (198 MCQs, 66 scenarios)Overcooked-AI layoutsHanabi ChallengeCollabCaptureCollabEscape

Benchmarks

LLM-Coordination BenchmarkCoordinationQA Suite

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM agents match or exceed RL on environment-driven Overcooked layouts.

LLMs fall far short of RL in Hanabi, a game needing deep partner-belief reasoning.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding