A benchmark showing LLMs can coordinate by reading environments but struggle at partners' beliefs and joint planning

October 5, 20237 min

Overview

Decision SnapshotNeeds Validation

LLMs can be used off-the-shelf for environment-heavy coordination but require extra verification and are too slow and costly for real-time, safety-critical multi-agent deployment.

Citations10

Evidence Strength0.75

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Saaket Agashe, Yue Fan, Anthony Reyna, Xin Eric Wang

Links

Abstract / PDF / Code

Why It Matters For Business

LLMs can act as zero-shot coordination partners for tasks where the environment dictates the correct action (logistics routing, scripted multi-robot tasks), cutting training time; but they are unreliable when partner modeling or multi-step joint planning is required.

Who Should Care

Summary TLDR

The paper introduces LLM-Coordination, a benchmark with two tasks: multi-turn Agentic Coordination (agents act inside games) and single-turn CoordinationQA (198 multiple-choice edge-case questions). Tested LLMs (GPT-4-turbo, GPT-4o, GPT-3.5-turbo, Mixtral) can match or beat RL on environment-driven coordination (Overcooked), are robust to unseen partners (zero-shot cross-play), but perform poorly when tasks require deep Theory-of-Mind (ToM) or joint planning (Hanabi and Joint Planning questions). Simple reasoning steps—explicit ToM inference and answer verification—reduce catastrophic mistakes and improve scores. Code is available. Practical takeaway: LLMs are promising for coordination when

Problem Statement

We lack a focused, comparative test of how current LLMs perform as coordination agents in pure-cooperation games. The paper asks: can LLMs act directly inside coordination environments, how do they compare to MARL baselines, and which component skills (environment reading, predicting partners' beliefs, joint planning) limit performance?

Main Contribution

A new LLM-Coordination benchmark with two settings: Agentic Coordination (LLMs act in four pure-coordination games) and CoordinationQA (198 multiple-choice edge-case questions).

A holistic empirical comparison of LLM-based agents versus multi-agent RL baselines in self-play and cross-play (zero-shot) scenarios.

Key Findings

LLM agents match or exceed RL on environment-driven Overcooked layouts.

NumbersGPT-4-turbo: 260 (AA layout) vs PBT: 190 (Table 1)

Practical UseUse LLM agents for coordination tasks dominated by observable environment state and role assignment; expect competitive performance without game-specific training.

Evidence RefTable 1

LLMs fall far short of RL in Hanabi, a game needing deep partner-belief reasoning.

NumbersGPT-4-turbo: 13.33 ±0.88 vs RL baselines ≈24 (Table 3)

Practical UseAvoid deploying current LLM agents where implicit partner beliefs and tight error margins matter; RL or specialized methods still better for such tasks.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overcooked score (Asymmetric Advantages layout)GPT-4-turbo 260 ±11.55PBT 190.1 ±8.64+~70Overcooked AA layoutGPT-4-turbo outperforms PBT in AA layoutTable 1
Hanabi scoreGPT-4-turbo 13.33 ±0.88Off-Belief Learning 24.10 ±0.01-~10.8Hanabi ChallengeLLM lags behind RL baselines in HanabiTable 3

What To Try In 7 Days

Plug a strong LLM (e.g., GPT-4-turbo) into a simple environment-driven coordination task and evaluate as a zero-shot partner.

Add a short verification step that rejects actions violating hard safety/rules to reduce catastrophic errors.

Create a small CoordinationQA-style test (edge cases) to measure environment understanding vs partner-modeling before deployment.

Agent Features

Memory
Long-term (game rules/procedures)Working memory (current state text)Episodic memory (previous actions)
Planning
Single-step action selectionExplicit Theory-of-Mind reasoning stepAnswer verification before action
Tool Use
Grounding module to map language to game actions
Frameworks
ReActSelf-VerificationSelf-ConsistencyCognitive Architectures for Language Agents
Is Agentic

Yes

Architectures
LLM-based agent with cognitive architecture scaffold
Collaboration
Self-play evaluationCross-play / Zero-shot coordination

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

High latency and compute for large LLMs make them unsuitable for real-time tasks.

Prompt and procedural memory require manual configuration to get good behavior.

When Not To Use

Real-time systems that need low-latency decisions.

Tasks that require deep partner belief modeling or tight error margins.

Failure Modes

Hallucinated actions that break game rules and cause catastrophic loss (Hanabi bombs).

Poor joint planning leading to worse-than-random decisions on multi-step coordination.

Core Entities

Models

GPT-4-turboGPT-4oGPT-3.5-turboMixtral-8x7BPPOPBTBADSADOff-Belief LearningBehavior CloningHSPPPO_BCOBL

Metrics

Overcooked score (points per delivery)Hanabi score (cards played)Success rate (capture/escape)Average turnsAccuracy

Datasets

CoordinationQA (198 MCQs, 66 scenarios)Overcooked-AI layoutsHanabi ChallengeCollabCaptureCollabEscape

Benchmarks

LLM-Coordination BenchmarkCoordinationQA Suite