Use natural-language instructions + LLM priors to steer multi‑agent RL toward human-friendly equilibria

April 13, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

10

Authors

Hengyuan Hu, Dorsa Sadigh

Links

Abstract / PDF

Why It Matters For Business

You can steer multi-agent systems to human-friendly conventions without costly human behavior datasets; showing the agent's instruction to users sharply improves team performance and trust.

Summary TLDR

The paper introduces instructRL: use an LLM to convert a human instruction plus a short language description of the current observation into a prior policy, then regularize RL with that prior so agents converge to equilibria humans expect. Evaluated on a toy 'Say-Select' game and the Hanabi benchmark. instructRL yields human-like conventions reliably, keeps competitive self-play scores, and—when humans are shown the training instruction—dramatically improves human-AI coordination in a user study.

Problem Statement

Multi-agent RL can converge to many equally optimal equilibria. Without human data, learned policies often use conventions that are hard for people to follow. The paper asks: can humans simply tell an AI the convention they want in plain language and have RL converge to that equilibrium?

Main Contribution

Propose instructRL: build an LLM-conditioned prior from a natural-language instruction plus short language observations, then regularize RL toward that prior.

Two implementations: instructQ (Q-learning + log-prior) and instructPPO (PPO + KL penalty).

Empirical demos: converges to human-like policies in a toy Say-Select game and produces distinct, instruction-following strategies in Hanabi.

Human study showing revealing the training instruction to humans sharply improves coordination.

Analysis of robustness to imperfect LLM priors, noisy prompts, and preliminary test-time adaptation experiments.

Key Findings

In the Say-Select toy game, instructQ reliably converged to the intended human-like equilibrium.

Numbers10/10 random seeds converged to the instructed policy

On Hanabi, instructRL achieves similar self-play performance as vanilla RL while producing different hinting conventions.

NumbersSelf-play scores ≈ 23.8–24.25 across methods (Table 3)

Showing humans the agent's instruction dramatically improved human-AI coordination in Hanabi.

NumbersHuman mean score rose from 9.80 ±3.35 (Q-learning) to 18.70 ±2.18 (instructQ with instruction shown)

instructRL tolerates imperfect LLM priors and some LLM noise without major policy drift.

NumbersSimple prompt errors ≈3% → cross-play drop 0.05 (0.21%); stable up to ~10–15% random prior noise (Figure 10, Table 6)

Results

Hanabi self-play scores (mean ± SE)

ValueQ-learning 23.96 ±0.05; InstructQ(color) 23.78 ±0.05; InstructPPO(color) 24.25 ±0.03

BaselineQ-learning

Human-AI coordination score (mean ± SE)

ValueQ-learning 9.80 ±3.35; InstructQ w/o instruction 7.80 ±3.23; InstructQ with instruction 18.70 ±2.18

BaselineQ-learning

Robustness to imperfect LLM prior

ValueSimple prompts: ~3% LLM error → cross-play drop 0.05 (0.21%); withstands ~10–15% random prior noise before notable drop

Baselineoriginal-inst (prompt-engineered)

Who Should Care

What To Try In 7 Days

Prototype a small environment where actions and observations can be expressed as text and generate LLM action priors.

Add a lightweight KL or log-prior regularizer to an existing RL agent and test if it converges to the desired convention.

At deployment, expose the short agent instruction to human partners and measure coordination metrics and subjective trust.

Agent Features

Memory

  • Short-term action-observation history (task-defined)

Planning

  • Regularized policy towards LLM prior

Tool Use

  • LLMs as action priors
  • OBL initialization

Frameworks

  • instructRL (instructQ, instructPPO)
  • Off-Belief Learning (OBL)

Is Agentic

true

Architectures

  • Q-learning
  • PPO
  • LLM-conditioned prior

Collaboration

  • Multi-agent coordination
  • Human-AI teaming

Optimization Features

Token Efficiency

  • Cache LLM logits for repeated observation-action pairs

Infra Optimization

  • Parallel rollout workers used for training

Training Optimization

  • Fine-tune from OBL level-1 instead of training from scratch
  • Anneal regularization weight λ during training

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires mapping actions and observations into concise text descriptions; not straightforward for continuous control.
  • LLM priors are coarse and can reflect LLM biases; strong regularization may harm optimality.
  • Instruction is fixed during RL training; no active online human feedback loop was used.
  • Evaluations limited to a toy environment and 2‑player Hanabi.

When Not To Use

  • Actions cannot be represented meaningfully in language (e.g., raw continuous joint commands).
  • When low-latency or zero-API-cost operation is required and LLM calls are infeasible.
  • When you must adapt instantly to unknown partners at test time without fine-tuning; test-time adaptation was not solved here.

Failure Modes

  • LLM gives wrong priors for relevant states and a high regularization weight forces suboptimal conventions.
  • Humans misinterpret the instruction at deployment, leading to worse coordination than opaque policies.
  • Post-hoc addition of priors (no fine-tuning) yields poor self-play and limited coordination gains.

Core Entities

Models

  • GPT-J-6B
  • text-davinci-003 (GPT-3.5)

Metrics

  • Self-play score (mean ± SE)
  • Intra-AXP (intra-algorithm cross-play)
  • Human game score (mean ± SE)
  • Game lost (games lost due to 3 strikes)

Datasets

  • Hanabi benchmark (2-player)
  • Say-Select (toy environment)

Benchmarks

  • Hanabi

Context Entities

Models

  • LLMs used as prior (any autoregressive text LM)

Metrics

  • Conditional action matrices (p(a_{t+1}|a_t))
  • Knowledge-of-card statistics when card played

Datasets

  • Human evaluation participants (10 people)

Benchmarks

  • Toy coordination tasks (Say-Select)