TextStarCraft II: a text-based StarCraft II benchmark and a Chain-of-Summarization (CoS) method that helps LLMs plan in real time

Overview

Decision SnapshotNeeds Validation

The system is a solid research prototype: CoS adds practical latency and context tricks, and code/data are released, but scripted micro-control and non-visual input limit production use in full-game settings.

Citations10

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Yuqiao Wu, Runji Lin, Haifeng Zhang, Jun Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TextStarCraft II and CoS show that LLMs can handle high-level, time-sensitive strategy where visual micro-control is scripted; this enables low-cost experimentation with strategic agents and rapid prototyping of language-driven decision systems.

Who Should Care

ML Engineer Data Scientist Product Manager

Summary TLDR

The authors build TextStarCraft II — a text-only StarCraft II environment — and introduce Chain of Summarization (CoS): single-frame + multi-frame summarization plus action extraction to let LLMs make timely macro decisions. They test closed-source and fine-tuned open LLMs, show that careful prompts and training-data filtering improve win rates versus the built-in AI, and demonstrate a fine-tuned Qwen1.8B model that plays at a Gold-player level against humans. Code and data are released.

Problem Statement

There is no standard benchmark for measuring LLMs on real-time strategic decision-making. StarCraft II is a demanding RTS testbed, but existing interfaces lack natural-language support and fast LLM-friendly summarization mechanisms. The paper creates a text interface and a summarization-based agent loop to let LLMs act at macro strategic timescales.

Main Contribution

TextStarCraft II: a text-based SC2 environment that converts observations to text and accepts language actions.

Chain of Summarization (CoS): single-frame and multi-frame summarization plus action extraction to let LLMs plan every K frames.

Key Findings

Closed-source LLMs using full CoS beat the level-5 built-in AI in many trials.

NumbersGPT-4: 12/20 wins, GPT3.5: 11/20 (Table 1)

Practical UseUse off-the-shelf high-capability LLMs with CoS for macro-level RTS decision-making; expect moderate win rates vs hard built-in bots without micro-control improvements.

Evidence RefTable 1; Section 5.1

Fine-tuning data quality strongly affects performance: training on high-APU wins boosted win rate.

NumbersTop 25% APU wins → 54/100; full dataset → 28/100 (Table 2)

Practical UseWhen fine-tuning LLMs for gameplay, filter logs by a performance metric (APU) to amplify successful strategies instead of using all data.

Evidence RefTable 2; Section 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Win Rate (GPT-4 vs lv5 built-in AI)	12/20	—	—	Table 1 (full CoS)	GPT-4 achieved 12 wins in 20 games using full CoS	Table 1
Win Rate (GPT3.5-turbo-16k vs lv5 built-in AI)	11/20	—	—	Table 1 (full CoS)	GPT3.5-turbo-16k achieved 11 wins in 20 games using full CoS	Table 1

What To Try In 7 Days

Run TextStarCraft II locally and replay a few games to inspect raw text observations and L1/L2 summaries.

Implement single-frame summarization rules to compress observations into concise inputs for an LLM.

Fine-tune a small open LLM on a top-APU subset of good-game logs and evaluate win rate vs built-in AI.

Agent Features

Memory

Short-term raw observation queue (K frames)Summarization cache (period summaries)

Planning

Multi-frame planning (period summaries every K frames)Action queue scheduling

Tool Use

python-sc2regex-based action extractorrule-based micro-action scripts

Frameworks

Chain of Summarization (CoS)Chain-of-Thought (CoT) used inside CoS

Is Agentic

Yes

Architectures

LLM-driven macro policy + rule-based micro-scriptsCoS summarization pipeline

Collaboration

Human-AI matchesExpert evaluation (double-blind)

Optimization Features

Token Efficiency

Single-frame compression to reduce input size

System Optimization

Action queue to batch K actions between LLM calls

Training Optimization

Fine-tuning on filtered high-APU wins

Inference Optimization

Multi-frame summarization to reduce LLM call frequency

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://anonymous.4open.science/r/Large-Language-Models-play-StarCraftII-8C45/readme.md

Data URLs

https://anonymous.4open.science/r/Large-Language-Models-play-StarCraftII-8C45/readme.md

Risks & Boundaries

Limitations

Relies on rule-based micro-scripts for unit-level control; not end-to-end visual/micro solution.

Text-only observations omit pixel/vision data, reducing fidelity vs full SC2 agents.

When Not To Use

When you need end-to-end visual micro-management (use full RL or visual agents).

When micro-level reaction time or precise unit control is required.

Failure Modes

Hallucinated or infeasible action proposals from the LLM that the action extractor cannot map.

Overfit to dataset artifacts (fine-tuned models adopted a single repetitive strategy).

Core Entities

Models

GPT-4GPT3.5-turbo-16kGemini-ProClaude2.1GLM4Llama2-70BChatGLM3-6BQwen-1.8BQwen-7BLlama2-7B

Metrics

Win RatePopulation Block Ratio (PBR)Resource Utilization Ratio (RUR)Average Population Utilization (APU)Technology Rate (TR)

Datasets

TextStarCraft II interaction logsWins subset (APU filtered)Full dataset (all games)

Benchmarks

TextStarCraft II

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Closed-source LLMs using full CoS beat the level-5 built-in AI in many trials.

Fine-tuning data quality strongly affects performance: training on high-APU wins boosted win rate.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding