Overview
The method is experimentally validated on a toy game and Hanabi with human tests; it is practical when actions/observations can be textified, but needs engineering for real-world grounding and has added LLM cost.
Citations10
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
You can steer multi-agent systems to human-friendly conventions without costly human behavior datasets; showing the agent's instruction to users sharply improves team performance and trust.
Who Should Care
Summary TLDR
The paper introduces instructRL: use an LLM to convert a human instruction plus a short language description of the current observation into a prior policy, then regularize RL with that prior so agents converge to equilibria humans expect. Evaluated on a toy 'Say-Select' game and the Hanabi benchmark. instructRL yields human-like conventions reliably, keeps competitive self-play scores, and—when humans are shown the training instruction—dramatically improves human-AI coordination in a user study.
Problem Statement
Multi-agent RL can converge to many equally optimal equilibria. Without human data, learned policies often use conventions that are hard for people to follow. The paper asks: can humans simply tell an AI the convention they want in plain language and have RL converge to that equilibrium?
Main Contribution
Propose instructRL: build an LLM-conditioned prior from a natural-language instruction plus short language observations, then regularize RL toward that prior.
Two implementations: instructQ (Q-learning + log-prior) and instructPPO (PPO + KL penalty).
Key Findings
In the Say-Select toy game, instructQ reliably converged to the intended human-like equilibrium.
On Hanabi, instructRL achieves similar self-play performance as vanilla RL while producing different hinting conventions.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Hanabi self-play scores (mean ± SE) | Q-learning 23.96 ±0.05; InstructQ(color) 23.78 ±0.05; InstructPPO(color) 24.25 ±0.03 | Q-learning | Differences within ~0.5 points | Hanabi 2-player | Table 3 shows self-play and intra-AXP results | Table 3 |
| Human-AI coordination score (mean ± SE) | Q-learning 9.80 ±3.35; InstructQ w/o instruction 7.80 ±3.23; InstructQ with instruction 18.70 ±2.18 | Q-learning | +8.9 points vs Q-learning when instruction shown | Human evaluation (10 participants) | Table 4 human study results | Table 4 |
What To Try In 7 Days
Prototype a small environment where actions and observations can be expressed as text and generate LLM action priors.
Add a lightweight KL or log-prior regularizer to an existing RL agent and test if it converges to the desired convention.
At deployment, expose the short agent instruction to human partners and measure coordination metrics and subjective trust.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires mapping actions and observations into concise text descriptions; not straightforward for continuous control.
LLM priors are coarse and can reflect LLM biases; strong regularization may harm optimality.
When Not To Use
Actions cannot be represented meaningfully in language (e.g., raw continuous joint commands).
When low-latency or zero-API-cost operation is required and LLM calls are infeasible.
Failure Modes
LLM gives wrong priors for relevant states and a high regularization weight forces suboptimal conventions.
Humans misinterpret the instruction at deployment, leading to worse coordination than opaque policies.

