STRATEGIST: LLMs learn and refine high-level strategies with bi-level tree search and self-play

August 20, 20249 min

Overview

Decision SnapshotNeeds Validation

This is a strong prototype: clear empirical gains on two games and human tests. Expect engineering work to scale the idea to real-world workflows and to reduce variance in noisy multi-agent feedback.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 75%

Authors

Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu

Links

Abstract / PDF / Code

Why It Matters For Business

STRATEGIST shows you can get usable, human-competitive strategies from LLMs without labeled training data by pairing LLM-written strategy text with search and simulated self-play, speeding prototyping of strategic agents and negotiation systems.

Who Should Care

Summary TLDR

STRATEGIST is a bi-level framework that uses an LLM to propose and iteratively improve human-readable high-level strategies, and uses a low-level Monte Carlo Tree Search (MCTS) executor to refine and evaluate those strategies via population self-play. It improves win-rates and strategy quality in two adversarial games (GOPS and Resistance: Avalon), outperforms several LLM-only self-improvement baselines and traditional RL baselines under matched simulation budgets, and produces dialogue guides that help LLM agents conceal identity and coordinate in social-deduction settings.

Problem Statement

LLMs generalize well but struggle to learn detailed, multi-step policies in adversarial multi-agent games. Directly asking an LLM for per-move actions is expensive and brittle. We need a way to let LLMs propose compact strategies and improve them efficiently without supervised training data.

Main Contribution

A bi-level, non-parametric framework (STRATEGIST) that represents high-level strategies as interpretable text/code and refines them via LLM-driven idea generation plus low-level MCTS refinement.

A modular improvement loop that stores candidate ‘improvement ideas’ in a priority queue and tests them via population-based self-play, avoiding training data.

Key Findings

STRATEGIST generated higher-quality value heuristics and dialogue guides than four LLM self-improvement baselines on the evaluated games.

NumbersGOPS value heuristic: +1.5 ±0.99 vs best baseline 0.092 ±0.67; Avalon Merlin guide: 0.88 ±0.063 vs baseline ≤0.62 (Table

Practical UseUse STRATEGIST's idea-queue + population self-play to get bigger improvements from LLM-driven strategy edits than greedy or line-search approaches on multi-turn games.

Evidence RefTable 2

Population-based self-play produced stronger feedback than LLM critics or fixed-opponent trajectories on the evaluated games.

NumbersGOPS point diff: 0.87 ±1.54 (population) vs -0.27 ±1.10 (LLM-critic); Avalon winrate: 0.88 ±0.064 vs 0.37 ±0.063 (Table

Practical UseWhen improving strategy text, simulate a diverse population of learned strategies rather than relying solely on LLM critique or a single fixed opponent to get more reliable improvement signals.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GOPS value heuristic (final gameplay score)1.5 ± 0.99Best-first search 0.092 ± 0.67≈ +1.4 points6-card GOPS (Table 2)Table 2 shows STRATEGIST 1.5 ±0.99 vs BFS 0.092 ±0.67Table 2
Avalon value heuristic (winrate)0.59 ± 0.11BFS 0.50 ± 0.085+~0.09 winrateAvalon (Table 2)Table 2 Avalon VH: STRATEGIST 0.59 ±0.11 vs BFS 0.50 ±0.085Table 2

What To Try In 7 Days

Reproduce a small-scale demo: have an LLM generate a simple value heuristic for a two-player toy game and run MCTS to evaluate it.

Implement an idea-queue: collect LLM-proposed edits as separate items and test them incrementally via self-play.

Run population-based self-play rather than single-opponent evaluation to get more robust improvement signals.

Agent Features

Memory
Population of strategies (strategy library) used for self-play
Planning
High-level strategy search via LLM revisionsLow-level Monte Carlo Tree Search (MCTS) refinement
Tool Use
LLM as strategy author and discriminatorMCTS as executor and shaping evaluator
Frameworks
Idea queue + UCB bandit sampling
Is Agentic

Yes

Architectures
bi-level tree search (high-level LLM strategy + low-level MCTS)
Collaboration
Round-robin population self-play for evaluation

Optimization Features

Token Efficiency

Trade-off: STRATEGIST uses more tokens per round than simple baselines (Table 5 shows higher tokens

System Optimization
Idea queue and UCB sampling to focus improvements
Training Optimization

No parametric training for strategy text; selective training of RL baselines used only for compariso

Inference Optimization
Low-level policy refinement via MCTS with adjustable search budget

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

High variance across individual runs; population feedback is noisy in multi-player settings (Section C).

Results shown only on two adversarial games; not tested on non-adversarial single-agent domains.

When Not To Use

When you can afford massive parametric RL training with large datasets and compute — RL may approximate value functions directly.

When the task is single-step or deterministic and does not benefit from strategic abstraction and search.

Failure Modes

Overfitting strategies to the population of learned opponents instead of general opponents.

LLM-proposed ideas that sound plausible but worsen gameplay; noisy feedback can amplify bad edits.

Core Entities

Models

GPT-3.5GPT-4

Metrics

Win ratePoint differenceImprovement score (idea)Tokens per round

Datasets

Game of Pure Strategy (GOPS)Resistance: Avalon (multi-agent dialogue game)

Benchmarks

Human vs STRATEGIST games (Avalon)Head-to-head play between generated strategies (GOPS, Avalon)

Context Entities

Models

AlphaGo-style MCTS + value network (baseline)DeepRole (counterfactual regret + value net)

Metrics

Standard error on win ratesSearch budget (MCTS rollouts)# of simulated episodes

Datasets

Simulated self-play trajectories (internal)Human evaluation sessions (Avalon)

Benchmarks

LLM-critic feedbackLine search / Greedy / BFS improvement baselines