A benchmark showing LLMs can coordinate by reading environments but struggle at partners' beliefs and joint planning

October 5, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

10

Authors

Saaket Agashe, Yue Fan, Anthony Reyna, Xin Eric Wang

Links

Abstract / PDF

Why It Matters For Business

LLMs can act as zero-shot coordination partners for tasks where the environment dictates the correct action (logistics routing, scripted multi-robot tasks), cutting training time; but they are unreliable when partner modeling or multi-step joint planning is required.

Summary TLDR

The paper introduces LLM-Coordination, a benchmark with two tasks: multi-turn Agentic Coordination (agents act inside games) and single-turn CoordinationQA (198 multiple-choice edge-case questions). Tested LLMs (GPT-4-turbo, GPT-4o, GPT-3.5-turbo, Mixtral) can match or beat RL on environment-driven coordination (Overcooked), are robust to unseen partners (zero-shot cross-play), but perform poorly when tasks require deep Theory-of-Mind (ToM) or joint planning (Hanabi and Joint Planning questions). Simple reasoning steps—explicit ToM inference and answer verification—reduce catastrophic mistakes and improve scores. Code is available. Practical takeaway: LLMs are promising for coordination when

Problem Statement

We lack a focused, comparative test of how current LLMs perform as coordination agents in pure-cooperation games. The paper asks: can LLMs act directly inside coordination environments, how do they compare to MARL baselines, and which component skills (environment reading, predicting partners' beliefs, joint planning) limit performance?

Main Contribution

A new LLM-Coordination benchmark with two settings: Agentic Coordination (LLMs act in four pure-coordination games) and CoordinationQA (198 multiple-choice edge-case questions).

A holistic empirical comparison of LLM-based agents versus multi-agent RL baselines in self-play and cross-play (zero-shot) scenarios.

A focused analysis that isolates three component skills—Environment Comprehension, Theory of Mind reasoning, and Joint Planning—and shows where LLMs succeed or fail.

Key Findings

LLM agents match or exceed RL on environment-driven Overcooked layouts.

NumbersGPT-4-turbo: 260 (AA layout) vs PBT: 190 (Table 1)

LLMs fall far short of RL in Hanabi, a game needing deep partner-belief reasoning.

NumbersGPT-4-turbo: 13.33 ±0.88 vs RL baselines ≈24 (Table 3)

CoordinationQA shows strengths in environment reading but weaknesses in ToM and joint planning.

NumbersGPT-4-turbo: >80% Environment Comprehension, best Joint Planning <40% (Figure 3)

Explicit ToM reasoning and answer verification reduce catastrophic failures in Hanabi.

NumbersHanabi score: 4.33 → 13.33 with ToM+Verification; Bomb rate 1.00 → 0.00 (Table 6)

LLM agents are robust in zero-shot cross-play with unseen partners.

NumbersGPT-4-turbo cross-play w/ OBL: 15.00 vs SAD cross-play: 11.33 (Table 5)

Results

Overcooked score (Asymmetric Advantages layout)

ValueGPT-4-turbo 260 ±11.55

BaselinePBT 190.1 ±8.64

Hanabi score

ValueGPT-4-turbo 13.33 ±0.88

BaselineOff-Belief Learning 24.10 ±0.01

CollabEscape capture rate / avg turns

ValueGPT-4-turbo capture 0.83, avg turns 4.60

BaselineGreedy baseline capture 0.00

Accuracy

ValueGPT-4-turbo >80%

BaselineRandom baseline

Accuracy

ValueBest LLM <40%

BaselineRandom baseline

Who Should Care

What To Try In 7 Days

Plug a strong LLM (e.g., GPT-4-turbo) into a simple environment-driven coordination task and evaluate as a zero-shot partner.

Add a short verification step that rejects actions violating hard safety/rules to reduce catastrophic errors.

Create a small CoordinationQA-style test (edge cases) to measure environment understanding vs partner-modeling before deployment.

Agent Features

Memory

  • Long-term (game rules/procedures)
  • Working memory (current state text)
  • Episodic memory (previous actions)

Planning

  • Single-step action selection
  • Explicit Theory-of-Mind reasoning step
  • Answer verification before action

Tool Use

  • Grounding module to map language to game actions

Frameworks

  • ReAct
  • Self-Verification
  • Self-Consistency
  • Cognitive Architectures for Language Agents

Is Agentic

true

Architectures

  • LLM-based agent with cognitive architecture scaffold

Collaboration

  • Self-play evaluation
  • Cross-play / Zero-shot coordination

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High latency and compute for large LLMs make them unsuitable for real-time tasks.
  • Prompt and procedural memory require manual configuration to get good behavior.
  • CoordinationQA was manually curated, limiting scalability and introducing selection bias.

When Not To Use

  • Real-time systems that need low-latency decisions.
  • Tasks that require deep partner belief modeling or tight error margins.
  • Resource-constrained deployments where LLM compute is infeasible.

Failure Modes

  • Hallucinated actions that break game rules and cause catastrophic loss (Hanabi bombs).
  • Poor joint planning leading to worse-than-random decisions on multi-step coordination.
  • High latency causing missed action windows in time-sensitive environments.

Core Entities

Models

  • GPT-4-turbo
  • GPT-4o
  • GPT-3.5-turbo
  • Mixtral-8x7B
  • PPO
  • PBT
  • BAD
  • SAD
  • Off-Belief Learning
  • Behavior Cloning
  • HSP
  • PPO_BC
  • OBL

Metrics

  • Overcooked score (points per delivery)
  • Hanabi score (cards played)
  • Success rate (capture/escape)
  • Average turns
  • Accuracy

Datasets

  • CoordinationQA (198 MCQs, 66 scenarios)
  • Overcooked-AI layouts
  • Hanabi Challenge
  • CollabCapture
  • CollabEscape

Benchmarks

  • LLM-Coordination Benchmark
  • CoordinationQA Suite