STRATEGIST: LLMs learn and refine high-level strategies with bi-level tree search and self-play

August 20, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.75

Cost Impact Score

0.6

Citation Count

0

Authors

Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu

Links

Abstract / PDF

Why It Matters For Business

STRATEGIST shows you can get usable, human-competitive strategies from LLMs without labeled training data by pairing LLM-written strategy text with search and simulated self-play, speeding prototyping of strategic agents and negotiation systems.

Summary TLDR

STRATEGIST is a bi-level framework that uses an LLM to propose and iteratively improve human-readable high-level strategies, and uses a low-level Monte Carlo Tree Search (MCTS) executor to refine and evaluate those strategies via population self-play. It improves win-rates and strategy quality in two adversarial games (GOPS and Resistance: Avalon), outperforms several LLM-only self-improvement baselines and traditional RL baselines under matched simulation budgets, and produces dialogue guides that help LLM agents conceal identity and coordinate in social-deduction settings.

Problem Statement

LLMs generalize well but struggle to learn detailed, multi-step policies in adversarial multi-agent games. Directly asking an LLM for per-move actions is expensive and brittle. We need a way to let LLMs propose compact strategies and improve them efficiently without supervised training data.

Main Contribution

A bi-level, non-parametric framework (STRATEGIST) that represents high-level strategies as interpretable text/code and refines them via LLM-driven idea generation plus low-level MCTS refinement.

A modular improvement loop that stores candidate ‘improvement ideas’ in a priority queue and tests them via population-based self-play, avoiding training data.

Empirical evaluation on two adversarial games (GOPS and Resistance: Avalon) showing better performance than several LLM self-improvement baselines and competitive or superior results vs RL baselines and human players.

Key Findings

STRATEGIST generated higher-quality value heuristics and dialogue guides than four LLM self-improvement baselines on the evaluated games.

NumbersGOPS value heuristic: +1.5 ±0.99 vs best baseline 0.092 ±0.67; Avalon Merlin guide: 0.88 ±0.063 vs baseline ≤0.62 (Table

Population-based self-play produced stronger feedback than LLM critics or fixed-opponent trajectories on the evaluated games.

NumbersGOPS point diff: 0.87 ±1.54 (population) vs -0.27 ±1.10 (LLM-critic); Avalon winrate: 0.88 ±0.064 vs 0.37 ±0.063 (Table

STRATEGIST beat RL baselines given the same simulated-episode budget and far fewer training transition steps.

NumbersGOPS vs DeepRole: +1.14 ±0.09 point diff (STRATEGIST) using 320 self-play eps and 100 transition steps vs DeepRole's ~3,

STRATEGIST achieves human-competitive results in Avalon but shows different strengths and weaknesses.

NumbersHuman win rate 0.367 ±0.089 vs STRATEGIST 0.333 ±0.061 (Table 1); humans rated STRATEGIST higher at concealment and adap

STRATEGIST's low-level MCTS refinement scales with search budget and amplifies high-level strategy improvements.

NumbersImproved value heuristics show larger performance gains as MCTS search budget increases (Figure 6).

Results

GOPS value heuristic (final gameplay score)

Value1.5 ± 0.99

BaselineBest-first search 0.092 ± 0.67

Avalon value heuristic (winrate)

Value0.59 ± 0.11

BaselineBFS 0.50 ± 0.085

Merlin dialogue guide (effectiveness score)

Value0.88 ± 0.063

BaselineGreedy 0.62 ± 0.13

Human vs STRATEGIST win rate (Avalon)

ValueSTRATEGIST 0.333 ±0.061

BaselineHuman 0.367 ±0.089

Head-to-head vs LLM baselines (winrate)

ValueSTRATEGIST vs ReAct 52.5 ±2.5; vs ReCon 61.1 ±5.5

BaselineReAct 47.5 ±2.5; ReCon 38.9 ±5.5

Feedback method comparison (GOPS point diff)

ValuePopulation self-play 0.87 ±1.54

BaselineLLM-critic -0.27 ±1.10

Who Should Care

What To Try In 7 Days

Reproduce a small-scale demo: have an LLM generate a simple value heuristic for a two-player toy game and run MCTS to evaluate it.

Implement an idea-queue: collect LLM-proposed edits as separate items and test them incrementally via self-play.

Run population-based self-play rather than single-opponent evaluation to get more robust improvement signals.

Agent Features

Memory

  • Population of strategies (strategy library) used for self-play

Planning

  • High-level strategy search via LLM revisions
  • Low-level Monte Carlo Tree Search (MCTS) refinement

Tool Use

  • LLM as strategy author and discriminator
  • MCTS as executor and shaping evaluator

Frameworks

  • Idea queue + UCB bandit sampling

Is Agentic

true

Architectures

  • bi-level tree search (high-level LLM strategy + low-level MCTS)

Collaboration

  • Round-robin population self-play for evaluation

Optimization Features

Token Efficiency

  • Trade-off: STRATEGIST uses more tokens per round than simple baselines (Table 5 shows higher tokens

System Optimization

  • Idea queue and UCB sampling to focus improvements

Training Optimization

  • No parametric training for strategy text; selective training of RL baselines used only for compariso

Inference Optimization

  • Low-level policy refinement via MCTS with adjustable search budget

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High variance across individual runs; population feedback is noisy in multi-player settings (Section C).
  • Results shown only on two adversarial games; not tested on non-adversarial single-agent domains.
  • LLM generation noise can affect idea quality and repeatability; paper mitigates by seeded functions and multiple runs.

When Not To Use

  • When you can afford massive parametric RL training with large datasets and compute — RL may approximate value functions directly.
  • When the task is single-step or deterministic and does not benefit from strategic abstraction and search.

Failure Modes

  • Overfitting strategies to the population of learned opponents instead of general opponents.
  • LLM-proposed ideas that sound plausible but worsen gameplay; noisy feedback can amplify bad edits.
  • High-dependence on MCTS compute budget: low budgets reduce gains from improved strategies.

Core Entities

Models

  • GPT-3.5
  • GPT-4

Metrics

  • Win rate
  • Point difference
  • Improvement score (idea)
  • Tokens per round

Datasets

  • Game of Pure Strategy (GOPS)
  • Resistance: Avalon (multi-agent dialogue game)

Benchmarks

  • Human vs STRATEGIST games (Avalon)
  • Head-to-head play between generated strategies (GOPS, Avalon)

Context Entities

Models

  • AlphaGo-style MCTS + value network (baseline)
  • DeepRole (counterfactual regret + value net)

Metrics

  • Standard error on win rates
  • Search budget (MCTS rollouts)
  • # of simulated episodes

Datasets

  • Simulated self-play trajectories (internal)
  • Human evaluation sessions (Avalon)

Benchmarks

  • LLM-critic feedback
  • Line search / Greedy / BFS improvement baselines