Overview
This is a strong prototype: clear empirical gains on two games and human tests. Expect engineering work to scale the idea to real-world workflows and to reduce variance in noisy multi-agent feedback.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 75%
Why It Matters For Business
STRATEGIST shows you can get usable, human-competitive strategies from LLMs without labeled training data by pairing LLM-written strategy text with search and simulated self-play, speeding prototyping of strategic agents and negotiation systems.
Who Should Care
Summary TLDR
STRATEGIST is a bi-level framework that uses an LLM to propose and iteratively improve human-readable high-level strategies, and uses a low-level Monte Carlo Tree Search (MCTS) executor to refine and evaluate those strategies via population self-play. It improves win-rates and strategy quality in two adversarial games (GOPS and Resistance: Avalon), outperforms several LLM-only self-improvement baselines and traditional RL baselines under matched simulation budgets, and produces dialogue guides that help LLM agents conceal identity and coordinate in social-deduction settings.
Problem Statement
LLMs generalize well but struggle to learn detailed, multi-step policies in adversarial multi-agent games. Directly asking an LLM for per-move actions is expensive and brittle. We need a way to let LLMs propose compact strategies and improve them efficiently without supervised training data.
Main Contribution
A bi-level, non-parametric framework (STRATEGIST) that represents high-level strategies as interpretable text/code and refines them via LLM-driven idea generation plus low-level MCTS refinement.
A modular improvement loop that stores candidate ‘improvement ideas’ in a priority queue and tests them via population-based self-play, avoiding training data.
Key Findings
STRATEGIST generated higher-quality value heuristics and dialogue guides than four LLM self-improvement baselines on the evaluated games.
Population-based self-play produced stronger feedback than LLM critics or fixed-opponent trajectories on the evaluated games.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GOPS value heuristic (final gameplay score) | 1.5 ± 0.99 | Best-first search 0.092 ± 0.67 | ≈ +1.4 points | 6-card GOPS (Table 2) | Table 2 shows STRATEGIST 1.5 ±0.99 vs BFS 0.092 ±0.67 | Table 2 |
| Avalon value heuristic (winrate) | 0.59 ± 0.11 | BFS 0.50 ± 0.085 | +~0.09 winrate | Avalon (Table 2) | Table 2 Avalon VH: STRATEGIST 0.59 ±0.11 vs BFS 0.50 ±0.085 | Table 2 |
What To Try In 7 Days
Reproduce a small-scale demo: have an LLM generate a simple value heuristic for a two-player toy game and run MCTS to evaluate it.
Implement an idea-queue: collect LLM-proposed edits as separate items and test them incrementally via self-play.
Run population-based self-play rather than single-opponent evaluation to get more robust improvement signals.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Trade-off: STRATEGIST uses more tokens per round than simple baselines (Table 5 shows higher tokens
System Optimization
Training Optimization
No parametric training for strategy text; selective training of RL baselines used only for compariso
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
High variance across individual runs; population feedback is noisy in multi-player settings (Section C).
Results shown only on two adversarial games; not tested on non-adversarial single-agent domains.
When Not To Use
When you can afford massive parametric RL training with large datasets and compute — RL may approximate value functions directly.
When the task is single-step or deterministic and does not benefit from strategic abstraction and search.
Failure Modes
Overfitting strategies to the population of learned opponents instead of general opponents.
LLM-proposed ideas that sound plausible but worsen gameplay; noisy feedback can amplify bad edits.

