Use natural-language instructions + LLM priors to steer multi‑agent RL toward human-friendly equilibria

Overview

Decision SnapshotReady For Pilot

The method is experimentally validated on a toy game and Hanabi with human tests; it is practical when actions/observations can be textified, but needs engineering for real-world grounding and has added LLM cost.

Citations10

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 60%

Authors

Hengyuan Hu, Dorsa Sadigh

Links

Abstract / PDF

Why It Matters For Business

You can steer multi-agent systems to human-friendly conventions without costly human behavior datasets; showing the agent's instruction to users sharply improves team performance and trust.

Who Should Care

ML Engineer Product Manager Engineering Lead

Summary TLDR

The paper introduces instructRL: use an LLM to convert a human instruction plus a short language description of the current observation into a prior policy, then regularize RL with that prior so agents converge to equilibria humans expect. Evaluated on a toy 'Say-Select' game and the Hanabi benchmark. instructRL yields human-like conventions reliably, keeps competitive self-play scores, and—when humans are shown the training instruction—dramatically improves human-AI coordination in a user study.

Problem Statement

Multi-agent RL can converge to many equally optimal equilibria. Without human data, learned policies often use conventions that are hard for people to follow. The paper asks: can humans simply tell an AI the convention they want in plain language and have RL converge to that equilibrium?

Main Contribution

Propose instructRL: build an LLM-conditioned prior from a natural-language instruction plus short language observations, then regularize RL toward that prior.

Two implementations: instructQ (Q-learning + log-prior) and instructPPO (PPO + KL penalty).

Key Findings

In the Say-Select toy game, instructQ reliably converged to the intended human-like equilibrium.

Numbers10/10 random seeds converged to the instructed policy

Practical UseIf you can describe actions and observations in text, adding an LLM prior can steer RL to the convention humans want with high reliability in small multi-agent problems.

Evidence RefSection 5.1, Figure 3, Appendix A.1

On Hanabi, instructRL achieves similar self-play performance as vanilla RL while producing different hinting conventions.

NumbersSelf-play scores ≈ 23.8–24.25 across methods (Table 3)

Practical UseBiasing learning toward a verbalized convention does not meaningfully reduce game performance on the tested benchmark, so you can trade convention choice without large score loss.

Evidence RefTable 3, Section 5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Hanabi self-play scores (mean ± SE)	Q-learning 23.96 ±0.05; InstructQ(color) 23.78 ±0.05; InstructPPO(color) 24.25 ±0.03	Q-learning	Differences within ~0.5 points	Hanabi 2-player	Table 3 shows self-play and intra-AXP results	Table 3
Human-AI coordination score (mean ± SE)	Q-learning 9.80 ±3.35; InstructQ w/o instruction 7.80 ±3.23; InstructQ with instruction 18.70 ±2.18	Q-learning	+8.9 points vs Q-learning when instruction shown	Human evaluation (10 participants)	Table 4 human study results	Table 4

What To Try In 7 Days

Prototype a small environment where actions and observations can be expressed as text and generate LLM action priors.

Add a lightweight KL or log-prior regularizer to an existing RL agent and test if it converges to the desired convention.

At deployment, expose the short agent instruction to human partners and measure coordination metrics and subjective trust.

Agent Features

Memory

Short-term action-observation history (task-defined)

Planning

Regularized policy towards LLM prior

Tool Use

LLMs as action priorsOBL initialization

Frameworks

instructRL (instructQ, instructPPO)Off-Belief Learning (OBL)

Is Agentic

Yes

Architectures

Q-learningPPOLLM-conditioned prior

Collaboration

Multi-agent coordinationHuman-AI teaming

Optimization Features

Token Efficiency

Cache LLM logits for repeated observation-action pairs

Infra Optimization

Parallel rollout workers used for training

Training Optimization

Fine-tune from OBL level-1 instead of training from scratchAnneal regularization weight λ during training

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Requires mapping actions and observations into concise text descriptions; not straightforward for continuous control.

LLM priors are coarse and can reflect LLM biases; strong regularization may harm optimality.

When Not To Use

Actions cannot be represented meaningfully in language (e.g., raw continuous joint commands).

When low-latency or zero-API-cost operation is required and LLM calls are infeasible.

Failure Modes

LLM gives wrong priors for relevant states and a high regularization weight forces suboptimal conventions.

Humans misinterpret the instruction at deployment, leading to worse coordination than opaque policies.

Core Entities

Models

GPT-J-6Btext-davinci-003 (GPT-3.5)

Metrics

Self-play score (mean ± SE)Intra-AXP (intra-algorithm cross-play)Human game score (mean ± SE)Game lost (games lost due to 3 strikes)

Datasets

Hanabi benchmark (2-player)Say-Select (toy environment)

Benchmarks

Hanabi

Context Entities

Models

LLMs used as prior (any autoregressive text LM)

Metrics

Conditional action matrices (p(a_{t+1}|a_t))Knowledge-of-card statistics when card played

Datasets

Human evaluation participants (10 people)

Benchmarks

Toy coordination tasks (Say-Select)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

In the Say-Select toy game, instructQ reliably converged to the intended human-like equilibrium.

On Hanabi, instructRL achieves similar self-play performance as vanilla RL while producing different hinting conventions.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding