Use natural-language instructions + LLM priors to steer multi‑agent RL toward human-friendly equilibria

April 13, 20238 min

Overview

Decision SnapshotReady For Pilot

The method is experimentally validated on a toy game and Hanabi with human tests; it is practical when actions/observations can be textified, but needs engineering for real-world grounding and has added LLM cost.

Citations10

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 60%

Authors

Hengyuan Hu, Dorsa Sadigh

Links

Abstract / PDF

Why It Matters For Business

You can steer multi-agent systems to human-friendly conventions without costly human behavior datasets; showing the agent's instruction to users sharply improves team performance and trust.

Who Should Care

Summary TLDR

The paper introduces instructRL: use an LLM to convert a human instruction plus a short language description of the current observation into a prior policy, then regularize RL with that prior so agents converge to equilibria humans expect. Evaluated on a toy 'Say-Select' game and the Hanabi benchmark. instructRL yields human-like conventions reliably, keeps competitive self-play scores, and—when humans are shown the training instruction—dramatically improves human-AI coordination in a user study.

Problem Statement

Multi-agent RL can converge to many equally optimal equilibria. Without human data, learned policies often use conventions that are hard for people to follow. The paper asks: can humans simply tell an AI the convention they want in plain language and have RL converge to that equilibrium?

Main Contribution

Propose instructRL: build an LLM-conditioned prior from a natural-language instruction plus short language observations, then regularize RL toward that prior.

Two implementations: instructQ (Q-learning + log-prior) and instructPPO (PPO + KL penalty).

Key Findings

In the Say-Select toy game, instructQ reliably converged to the intended human-like equilibrium.

Numbers10/10 random seeds converged to the instructed policy

Practical UseIf you can describe actions and observations in text, adding an LLM prior can steer RL to the convention humans want with high reliability in small multi-agent problems.

Evidence RefSection 5.1, Figure 3, Appendix A.1

On Hanabi, instructRL achieves similar self-play performance as vanilla RL while producing different hinting conventions.

NumbersSelf-play scores ≈ 23.824.25 across methods (Table 3)

Practical UseBiasing learning toward a verbalized convention does not meaningfully reduce game performance on the tested benchmark, so you can trade convention choice without large score loss.

Evidence RefTable 3, Section 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Hanabi self-play scores (mean ± SE)Q-learning 23.96 ±0.05; InstructQ(color) 23.78 ±0.05; InstructPPO(color) 24.25 ±0.03Q-learningDifferences within ~0.5 pointsHanabi 2-playerTable 3 shows self-play and intra-AXP resultsTable 3
Human-AI coordination score (mean ± SE)Q-learning 9.80 ±3.35; InstructQ w/o instruction 7.80 ±3.23; InstructQ with instruction 18.70 ±2.18Q-learning+8.9 points vs Q-learning when instruction shownHuman evaluation (10 participants)Table 4 human study resultsTable 4

What To Try In 7 Days

Prototype a small environment where actions and observations can be expressed as text and generate LLM action priors.

Add a lightweight KL or log-prior regularizer to an existing RL agent and test if it converges to the desired convention.

At deployment, expose the short agent instruction to human partners and measure coordination metrics and subjective trust.

Agent Features

Memory
Short-term action-observation history (task-defined)
Planning
Regularized policy towards LLM prior
Tool Use
LLMs as action priorsOBL initialization
Frameworks
instructRL (instructQ, instructPPO)Off-Belief Learning (OBL)
Is Agentic

Yes

Architectures
Q-learningPPOLLM-conditioned prior
Collaboration
Multi-agent coordinationHuman-AI teaming

Optimization Features

Token Efficiency
Cache LLM logits for repeated observation-action pairs
Infra Optimization
Parallel rollout workers used for training
Training Optimization
Fine-tune from OBL level-1 instead of training from scratchAnneal regularization weight λ during training

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires mapping actions and observations into concise text descriptions; not straightforward for continuous control.

LLM priors are coarse and can reflect LLM biases; strong regularization may harm optimality.

When Not To Use

Actions cannot be represented meaningfully in language (e.g., raw continuous joint commands).

When low-latency or zero-API-cost operation is required and LLM calls are infeasible.

Failure Modes

LLM gives wrong priors for relevant states and a high regularization weight forces suboptimal conventions.

Humans misinterpret the instruction at deployment, leading to worse coordination than opaque policies.

Core Entities

Models

GPT-J-6Btext-davinci-003 (GPT-3.5)

Metrics

Self-play score (mean ± SE)Intra-AXP (intra-algorithm cross-play)Human game score (mean ± SE)Game lost (games lost due to 3 strikes)

Datasets

Hanabi benchmark (2-player)Say-Select (toy environment)

Benchmarks

Hanabi

Context Entities

Models

LLMs used as prior (any autoregressive text LM)

Metrics

Conditional action matrices (p(a_{t+1}|a_t))Knowledge-of-card statistics when card played

Datasets

Human evaluation participants (10 people)

Benchmarks

Toy coordination tasks (Say-Select)