STRATEGIST: LLMs learn and refine high-level strategies with bi-level tree search and self-play

Overview

Decision SnapshotNeeds Validation

This is a strong prototype: clear empirical gains on two games and human tests. Expect engineering work to scale the idea to real-world workflows and to reduce variance in noisy multi-agent feedback.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 75%

Authors

Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu

Links

Abstract / PDF / Code

Why It Matters For Business

STRATEGIST shows you can get usable, human-competitive strategies from LLMs without labeled training data by pairing LLM-written strategy text with search and simulated self-play, speeding prototyping of strategic agents and negotiation systems.

Who Should Care

Product Manager ML Engineer Founder CTO

Summary TLDR

STRATEGIST is a bi-level framework that uses an LLM to propose and iteratively improve human-readable high-level strategies, and uses a low-level Monte Carlo Tree Search (MCTS) executor to refine and evaluate those strategies via population self-play. It improves win-rates and strategy quality in two adversarial games (GOPS and Resistance: Avalon), outperforms several LLM-only self-improvement baselines and traditional RL baselines under matched simulation budgets, and produces dialogue guides that help LLM agents conceal identity and coordinate in social-deduction settings.

Problem Statement

LLMs generalize well but struggle to learn detailed, multi-step policies in adversarial multi-agent games. Directly asking an LLM for per-move actions is expensive and brittle. We need a way to let LLMs propose compact strategies and improve them efficiently without supervised training data.

Main Contribution

A bi-level, non-parametric framework (STRATEGIST) that represents high-level strategies as interpretable text/code and refines them via LLM-driven idea generation plus low-level MCTS refinement.

A modular improvement loop that stores candidate ‘improvement ideas’ in a priority queue and tests them via population-based self-play, avoiding training data.

Key Findings

STRATEGIST generated higher-quality value heuristics and dialogue guides than four LLM self-improvement baselines on the evaluated games.

NumbersGOPS value heuristic: +1.5 ±0.99 vs best baseline 0.092 ±0.67; Avalon Merlin guide: 0.88 ±0.063 vs baseline ≤0.62 (Table

Practical UseUse STRATEGIST's idea-queue + population self-play to get bigger improvements from LLM-driven strategy edits than greedy or line-search approaches on multi-turn games.

Evidence RefTable 2

Population-based self-play produced stronger feedback than LLM critics or fixed-opponent trajectories on the evaluated games.

NumbersGOPS point diff: 0.87 ±1.54 (population) vs -0.27 ±1.10 (LLM-critic); Avalon winrate: 0.88 ±0.064 vs 0.37 ±0.063 (Table

Practical UseWhen improving strategy text, simulate a diverse population of learned strategies rather than relying solely on LLM critique or a single fixed opponent to get more reliable improvement signals.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GOPS value heuristic (final gameplay score)	1.5 ± 0.99	Best-first search 0.092 ± 0.67	≈ +1.4 points	6-card GOPS (Table 2)	Table 2 shows STRATEGIST 1.5 ±0.99 vs BFS 0.092 ±0.67	Table 2
Avalon value heuristic (winrate)	0.59 ± 0.11	BFS 0.50 ± 0.085	+~0.09 winrate	Avalon (Table 2)	Table 2 Avalon VH: STRATEGIST 0.59 ±0.11 vs BFS 0.50 ±0.085	Table 2

What To Try In 7 Days

Reproduce a small-scale demo: have an LLM generate a simple value heuristic for a two-player toy game and run MCTS to evaluate it.

Implement an idea-queue: collect LLM-proposed edits as separate items and test them incrementally via self-play.

Run population-based self-play rather than single-opponent evaluation to get more robust improvement signals.

Agent Features

Memory

Population of strategies (strategy library) used for self-play

Planning

High-level strategy search via LLM revisionsLow-level Monte Carlo Tree Search (MCTS) refinement

Tool Use

LLM as strategy author and discriminatorMCTS as executor and shaping evaluator

Frameworks

Idea queue + UCB bandit sampling

Is Agentic

Yes

Architectures

bi-level tree search (high-level LLM strategy + low-level MCTS)

Collaboration

Round-robin population self-play for evaluation

Optimization Features

Token Efficiency

Trade-off: STRATEGIST uses more tokens per round than simple baselines (Table 5 shows higher tokens

System Optimization

Idea queue and UCB sampling to focus improvements

Training Optimization

No parametric training for strategy text; selective training of RL baselines used only for compariso

Inference Optimization

Low-level policy refinement via MCTS with adjustable search budget

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://llm-strategist.github.io

Risks & Boundaries

Limitations

High variance across individual runs; population feedback is noisy in multi-player settings (Section C).

Results shown only on two adversarial games; not tested on non-adversarial single-agent domains.

When Not To Use

When you can afford massive parametric RL training with large datasets and compute — RL may approximate value functions directly.

When the task is single-step or deterministic and does not benefit from strategic abstraction and search.

Failure Modes

Overfitting strategies to the population of learned opponents instead of general opponents.

LLM-proposed ideas that sound plausible but worsen gameplay; noisy feedback can amplify bad edits.

Core Entities

Models

GPT-3.5GPT-4

Metrics

Win ratePoint differenceImprovement score (idea)Tokens per round

Datasets

Game of Pure Strategy (GOPS)Resistance: Avalon (multi-agent dialogue game)

Benchmarks

Human vs STRATEGIST games (Avalon)Head-to-head play between generated strategies (GOPS, Avalon)

Context Entities

Models

AlphaGo-style MCTS + value network (baseline)DeepRole (counterfactual regret + value net)

Metrics

Standard error on win ratesSearch budget (MCTS rollouts)# of simulated episodes

Datasets

Simulated self-play trajectories (internal)Human evaluation sessions (Avalon)

Benchmarks

LLM-critic feedbackLine search / Greedy / BFS improvement baselines

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

STRATEGIST generated higher-quality value heuristics and dialogue guides than four LLM self-improvement baselines on the evaluated games.

Population-based self-play produced stronger feedback than LLM critics or fixed-opponent trajectories on the evaluated games.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding