TextStarCraft II: a text-based StarCraft II benchmark and a Chain-of-Summarization (CoS) method that helps LLMs plan in real time

December 19, 20238 min

Overview

Decision SnapshotNeeds Validation

The system is a solid research prototype: CoS adds practical latency and context tricks, and code/data are released, but scripted micro-control and non-visual input limit production use in full-game settings.

Citations10

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Yuqiao Wu, Runji Lin, Haifeng Zhang, Jun Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TextStarCraft II and CoS show that LLMs can handle high-level, time-sensitive strategy where visual micro-control is scripted; this enables low-cost experimentation with strategic agents and rapid prototyping of language-driven decision systems.

Who Should Care

Summary TLDR

The authors build TextStarCraft II — a text-only StarCraft II environment — and introduce Chain of Summarization (CoS): single-frame + multi-frame summarization plus action extraction to let LLMs make timely macro decisions. They test closed-source and fine-tuned open LLMs, show that careful prompts and training-data filtering improve win rates versus the built-in AI, and demonstrate a fine-tuned Qwen1.8B model that plays at a Gold-player level against humans. Code and data are released.

Problem Statement

There is no standard benchmark for measuring LLMs on real-time strategic decision-making. StarCraft II is a demanding RTS testbed, but existing interfaces lack natural-language support and fast LLM-friendly summarization mechanisms. The paper creates a text interface and a summarization-based agent loop to let LLMs act at macro strategic timescales.

Main Contribution

TextStarCraft II: a text-based SC2 environment that converts observations to text and accepts language actions.

Chain of Summarization (CoS): single-frame and multi-frame summarization plus action extraction to let LLMs plan every K frames.

Key Findings

Closed-source LLMs using full CoS beat the level-5 built-in AI in many trials.

NumbersGPT-4: 12/20 wins, GPT3.5: 11/20 (Table 1)

Practical UseUse off-the-shelf high-capability LLMs with CoS for macro-level RTS decision-making; expect moderate win rates vs hard built-in bots without micro-control improvements.

Evidence RefTable 1; Section 5.1

Fine-tuning data quality strongly affects performance: training on high-APU wins boosted win rate.

NumbersTop 25% APU wins → 54/100; full dataset → 28/100 (Table 2)

Practical UseWhen fine-tuning LLMs for gameplay, filter logs by a performance metric (APU) to amplify successful strategies instead of using all data.

Evidence RefTable 2; Section 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Win Rate (GPT-4 vs lv5 built-in AI)12/20Table 1 (full CoS)GPT-4 achieved 12 wins in 20 games using full CoSTable 1
Win Rate (GPT3.5-turbo-16k vs lv5 built-in AI)11/20Table 1 (full CoS)GPT3.5-turbo-16k achieved 11 wins in 20 games using full CoSTable 1

What To Try In 7 Days

Run TextStarCraft II locally and replay a few games to inspect raw text observations and L1/L2 summaries.

Implement single-frame summarization rules to compress observations into concise inputs for an LLM.

Fine-tune a small open LLM on a top-APU subset of good-game logs and evaluate win rate vs built-in AI.

Agent Features

Memory
Short-term raw observation queue (K frames)Summarization cache (period summaries)
Planning
Multi-frame planning (period summaries every K frames)Action queue scheduling
Tool Use
python-sc2regex-based action extractorrule-based micro-action scripts
Frameworks
Chain of Summarization (CoS)Chain-of-Thought (CoT) used inside CoS
Is Agentic

Yes

Architectures
LLM-driven macro policy + rule-based micro-scriptsCoS summarization pipeline
Collaboration
Human-AI matchesExpert evaluation (double-blind)

Optimization Features

Token Efficiency
Single-frame compression to reduce input size
System Optimization
Action queue to batch K actions between LLM calls
Training Optimization
Fine-tuning on filtered high-APU wins
Inference Optimization
Multi-frame summarization to reduce LLM call frequency

Reproducibility

Risks & Boundaries

Limitations

Relies on rule-based micro-scripts for unit-level control; not end-to-end visual/micro solution.

Text-only observations omit pixel/vision data, reducing fidelity vs full SC2 agents.

When Not To Use

When you need end-to-end visual micro-management (use full RL or visual agents).

When micro-level reaction time or precise unit control is required.

Failure Modes

Hallucinated or infeasible action proposals from the LLM that the action extractor cannot map.

Overfit to dataset artifacts (fine-tuned models adopted a single repetitive strategy).

Core Entities

Models

GPT-4GPT3.5-turbo-16kGemini-ProClaude2.1GLM4Llama2-70BChatGLM3-6BQwen-1.8BQwen-7BLlama2-7B

Metrics

Win RatePopulation Block Ratio (PBR)Resource Utilization Ratio (RUR)Average Population Utilization (APU)Technology Rate (TR)

Datasets

TextStarCraft II interaction logsWins subset (APU filtered)Full dataset (all games)

Benchmarks

TextStarCraft II