Overview
Production Readiness
0.6
Novelty Score
0.75
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
STRATEGIST shows you can get usable, human-competitive strategies from LLMs without labeled training data by pairing LLM-written strategy text with search and simulated self-play, speeding prototyping of strategic agents and negotiation systems.
Summary TLDR
STRATEGIST is a bi-level framework that uses an LLM to propose and iteratively improve human-readable high-level strategies, and uses a low-level Monte Carlo Tree Search (MCTS) executor to refine and evaluate those strategies via population self-play. It improves win-rates and strategy quality in two adversarial games (GOPS and Resistance: Avalon), outperforms several LLM-only self-improvement baselines and traditional RL baselines under matched simulation budgets, and produces dialogue guides that help LLM agents conceal identity and coordinate in social-deduction settings.
Problem Statement
LLMs generalize well but struggle to learn detailed, multi-step policies in adversarial multi-agent games. Directly asking an LLM for per-move actions is expensive and brittle. We need a way to let LLMs propose compact strategies and improve them efficiently without supervised training data.
Main Contribution
A bi-level, non-parametric framework (STRATEGIST) that represents high-level strategies as interpretable text/code and refines them via LLM-driven idea generation plus low-level MCTS refinement.
A modular improvement loop that stores candidate ‘improvement ideas’ in a priority queue and tests them via population-based self-play, avoiding training data.
Empirical evaluation on two adversarial games (GOPS and Resistance: Avalon) showing better performance than several LLM self-improvement baselines and competitive or superior results vs RL baselines and human players.
Key Findings
STRATEGIST generated higher-quality value heuristics and dialogue guides than four LLM self-improvement baselines on the evaluated games.
Population-based self-play produced stronger feedback than LLM critics or fixed-opponent trajectories on the evaluated games.
STRATEGIST beat RL baselines given the same simulated-episode budget and far fewer training transition steps.
STRATEGIST achieves human-competitive results in Avalon but shows different strengths and weaknesses.
STRATEGIST's low-level MCTS refinement scales with search budget and amplifies high-level strategy improvements.
Results
GOPS value heuristic (final gameplay score)
Avalon value heuristic (winrate)
Merlin dialogue guide (effectiveness score)
Human vs STRATEGIST win rate (Avalon)
Head-to-head vs LLM baselines (winrate)
Feedback method comparison (GOPS point diff)
Who Should Care
What To Try In 7 Days
Reproduce a small-scale demo: have an LLM generate a simple value heuristic for a two-player toy game and run MCTS to evaluate it.
Implement an idea-queue: collect LLM-proposed edits as separate items and test them incrementally via self-play.
Run population-based self-play rather than single-opponent evaluation to get more robust improvement signals.
Agent Features
Memory
- Population of strategies (strategy library) used for self-play
Planning
- High-level strategy search via LLM revisions
- Low-level Monte Carlo Tree Search (MCTS) refinement
Tool Use
- LLM as strategy author and discriminator
- MCTS as executor and shaping evaluator
Frameworks
- Idea queue + UCB bandit sampling
Is Agentic
true
Architectures
- bi-level tree search (high-level LLM strategy + low-level MCTS)
Collaboration
- Round-robin population self-play for evaluation
Optimization Features
Token Efficiency
- Trade-off: STRATEGIST uses more tokens per round than simple baselines (Table 5 shows higher tokens
System Optimization
- Idea queue and UCB sampling to focus improvements
Training Optimization
- No parametric training for strategy text; selective training of RL baselines used only for compariso
Inference Optimization
- Low-level policy refinement via MCTS with adjustable search budget
Reproducibility
Code Urls
Open Source Status
- partial
Risks & Boundaries
Limitations
- High variance across individual runs; population feedback is noisy in multi-player settings (Section C).
- Results shown only on two adversarial games; not tested on non-adversarial single-agent domains.
- LLM generation noise can affect idea quality and repeatability; paper mitigates by seeded functions and multiple runs.
When Not To Use
- When you can afford massive parametric RL training with large datasets and compute — RL may approximate value functions directly.
- When the task is single-step or deterministic and does not benefit from strategic abstraction and search.
Failure Modes
- Overfitting strategies to the population of learned opponents instead of general opponents.
- LLM-proposed ideas that sound plausible but worsen gameplay; noisy feedback can amplify bad edits.
- High-dependence on MCTS compute budget: low budgets reduce gains from improved strategies.
Core Entities
Models
- GPT-3.5
- GPT-4
Metrics
- Win rate
- Point difference
- Improvement score (idea)
- Tokens per round
Datasets
- Game of Pure Strategy (GOPS)
- Resistance: Avalon (multi-agent dialogue game)
Benchmarks
- Human vs STRATEGIST games (Avalon)
- Head-to-head play between generated strategies (GOPS, Avalon)
Context Entities
Models
- AlphaGo-style MCTS + value network (baseline)
- DeepRole (counterfactual regret + value net)
Metrics
- Standard error on win rates
- Search budget (MCTS rollouts)
- # of simulated episodes
Datasets
- Simulated self-play trajectories (internal)
- Human evaluation sessions (Avalon)
Benchmarks
- LLM-critic feedback
- Line search / Greedy / BFS improvement baselines

