Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
10
Why It Matters For Business
You can steer multi-agent systems to human-friendly conventions without costly human behavior datasets; showing the agent's instruction to users sharply improves team performance and trust.
Summary TLDR
The paper introduces instructRL: use an LLM to convert a human instruction plus a short language description of the current observation into a prior policy, then regularize RL with that prior so agents converge to equilibria humans expect. Evaluated on a toy 'Say-Select' game and the Hanabi benchmark. instructRL yields human-like conventions reliably, keeps competitive self-play scores, and—when humans are shown the training instruction—dramatically improves human-AI coordination in a user study.
Problem Statement
Multi-agent RL can converge to many equally optimal equilibria. Without human data, learned policies often use conventions that are hard for people to follow. The paper asks: can humans simply tell an AI the convention they want in plain language and have RL converge to that equilibrium?
Main Contribution
Propose instructRL: build an LLM-conditioned prior from a natural-language instruction plus short language observations, then regularize RL toward that prior.
Two implementations: instructQ (Q-learning + log-prior) and instructPPO (PPO + KL penalty).
Empirical demos: converges to human-like policies in a toy Say-Select game and produces distinct, instruction-following strategies in Hanabi.
Human study showing revealing the training instruction to humans sharply improves coordination.
Analysis of robustness to imperfect LLM priors, noisy prompts, and preliminary test-time adaptation experiments.
Key Findings
In the Say-Select toy game, instructQ reliably converged to the intended human-like equilibrium.
On Hanabi, instructRL achieves similar self-play performance as vanilla RL while producing different hinting conventions.
Showing humans the agent's instruction dramatically improved human-AI coordination in Hanabi.
instructRL tolerates imperfect LLM priors and some LLM noise without major policy drift.
Results
Hanabi self-play scores (mean ± SE)
Human-AI coordination score (mean ± SE)
Robustness to imperfect LLM prior
Who Should Care
What To Try In 7 Days
Prototype a small environment where actions and observations can be expressed as text and generate LLM action priors.
Add a lightweight KL or log-prior regularizer to an existing RL agent and test if it converges to the desired convention.
At deployment, expose the short agent instruction to human partners and measure coordination metrics and subjective trust.
Agent Features
Memory
- Short-term action-observation history (task-defined)
Planning
- Regularized policy towards LLM prior
Tool Use
- LLMs as action priors
- OBL initialization
Frameworks
- instructRL (instructQ, instructPPO)
- Off-Belief Learning (OBL)
Is Agentic
true
Architectures
- Q-learning
- PPO
- LLM-conditioned prior
Collaboration
- Multi-agent coordination
- Human-AI teaming
Optimization Features
Token Efficiency
- Cache LLM logits for repeated observation-action pairs
Infra Optimization
- Parallel rollout workers used for training
Training Optimization
- Fine-tune from OBL level-1 instead of training from scratch
- Anneal regularization weight λ during training
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires mapping actions and observations into concise text descriptions; not straightforward for continuous control.
- LLM priors are coarse and can reflect LLM biases; strong regularization may harm optimality.
- Instruction is fixed during RL training; no active online human feedback loop was used.
- Evaluations limited to a toy environment and 2‑player Hanabi.
When Not To Use
- Actions cannot be represented meaningfully in language (e.g., raw continuous joint commands).
- When low-latency or zero-API-cost operation is required and LLM calls are infeasible.
- When you must adapt instantly to unknown partners at test time without fine-tuning; test-time adaptation was not solved here.
Failure Modes
- LLM gives wrong priors for relevant states and a high regularization weight forces suboptimal conventions.
- Humans misinterpret the instruction at deployment, leading to worse coordination than opaque policies.
- Post-hoc addition of priors (no fine-tuning) yields poor self-play and limited coordination gains.
Core Entities
Models
- GPT-J-6B
- text-davinci-003 (GPT-3.5)
Metrics
- Self-play score (mean ± SE)
- Intra-AXP (intra-algorithm cross-play)
- Human game score (mean ± SE)
- Game lost (games lost due to 3 strikes)
Datasets
- Hanabi benchmark (2-player)
- Say-Select (toy environment)
Benchmarks
- Hanabi
Context Entities
Models
- LLMs used as prior (any autoregressive text LM)
Metrics
- Conditional action matrices (p(a_{t+1}|a_t))
- Knowledge-of-card statistics when card played
Datasets
- Human evaluation participants (10 people)
Benchmarks
- Toy coordination tasks (Say-Select)

