WorldCoder: have an LLM write Python world models, plan with them, and learn much faster than deep RL

February 19, 20248 min

Overview

Production Readiness

0.45

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

2

Authors

Hao Tang, Darren Key, Kevin Ellis

Links

Abstract / PDF

Why It Matters For Business

If environment interactions are expensive, having an LLM generate an inspectable Python simulator can cut trial costs by orders of magnitude and centralize expensive LLM calls into a one-time synthesis step.

Summary TLDR

WorldCoder uses an LLM (GPT‑4) to synthesize Python programs that implement a transition function and reward function. The agent plans with that program (MCTS or value iteration) and enforces an "optimism under uncertainty" constraint so the synthesized model prefers reachable reward — which induces goal-directed exploration. Evaluated on Sokoban, MiniGrid, and AlfWorld, WorldCoder learns useful, human-readable world-model code (250+ lines in some tasks), solves basic puzzles with tens of interactions (e.g., ~50 actions in Sokoban), and drastically reduces environment interactions versus deep RL and LLMs that act at every step.

Problem Statement

Can an LLM build a reusable, human-readable world model from a small number of interactions so a planner can achieve many goals? Challenges: searching the space of programs is combinatorial; exploration is hard with sparse rewards; and models should transfer across environments and new natural-language goals.

Main Contribution

WorldCoder architecture: LLM-guided program synthesis of Python world models (transition + reward) used by planners.

An optimism-under-uncertainty objective formulated as a logical constraint that drives goal-directed exploration and yields polynomial sample complexity (Theorem 2.4).

Practical transfer and refinement workflow: LLM refinement (REx prioritization) allows fast reuse and editing of code across grid and robot planning domains.

Key Findings

WorldCoder builds a Sokoban world model from very few interactions.

Numbers≈50 environment actions to build initial model (Sokoban)

A ReAct-style LLM agent with the same pretraining struggles to play Sokoban.

Numbers15% ± 8% success on basic Sokoban levels

Deep RL needs orders of magnitude more interactions on Sokoban.

Numbers>1e6 environment steps for basic 2-box Sokoban (deep RL baselines)

WorldCoder can produce large, actionable models for complex domains.

NumbersSynthesized world model >250 lines (AlfWorld)

Optimism objective gives formal sample-efficiency guarantees.

NumbersMax actions ≤ D_{S,A,T} × (K_{T×R} + 1) (Theorem 2.4)

Results

Success rate (ReAct baseline, basic Sokoban)

Value15% ± 8%

Environment steps to learn (deep RL)

Value>1e6 steps

Environment steps to build initial model (WorldCoder)

Value≈50 steps

Episodes/steps to first reward (AlfWorld tasks)

Value≈1 episode but ~20 exploratory steps to first reward

Synthesized model size

Value250+ lines of Python

Amortized LLM calls per action after learning

Value≤1 LLM call per task (amortized); compared to O(T) for ReAct

BaselineReAct-style agents

Who Should Care

What To Try In 7 Days

Run a small prototype: collect 50–200 symbolic interactions in a toy grid and prompt GPT‑4 to synthesize transition+reward code.

Add an optimism constraint: when multiple reward models fit data, prefer models that imply achievable intermediate rewards and test whether exploration becomes goal-directed.

Measure amortized LLM cost: compare LLM-token spend for per-action prompting (ReAct) versus one-time model synthesis + planning.

Agent Features

Memory

  • Replay buffer D of (s,a,r,s',c,d)
  • D_sc: stored initial states and contexts used for optimism constraint
  • Program code retained and refined across tasks

Planning

  • Monte Carlo Tree Search (deterministic, heuristic-backed)
  • depth-limited value iteration

Tool Use

  • Python world model (transition and reward subroutines)
  • LLM self-debugging / refinement (GPT-4)
  • BM25 heuristic for text-to-trajectory matching

Frameworks

  • REx refinement prioritization

Is Agentic

true

Architectures

  • model-based agent
  • LLM-guided program synthesis

Optimization Features

Token Efficiency

  • Front-load LLM tokens to synthesize model; then zero LLM tokens per action

Training Optimization

  • Program refinement instead of full search reduces LLM calls

Inference Optimization

  • Amortizes LLM cost by using synthesized simulator for many actions

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Assumes deterministic environment dynamics; does not model stochastic transitions.
  • Requires symbolic, discrete state representations (not raw pixels) or reliable object detectors.
  • Relies on strong code-capable LLM (GPT‑4); poorer LLMs may fail to synthesize correct models.
  • Planning becomes the bottleneck on very hard instances; world model alone does not solve intractable planning tasks.

When Not To Use

  • Tasks without symbolic state extraction or where object detectors are unavailable.
  • Highly stochastic environments where deterministic program models are inappropriate.
  • Problems requiring continuous control at high frequency without a symbolic abstraction.
  • Settings without access to a powerful code-writing LLM or where LLM costs prohibit synthesis.

Failure Modes

  • LLM synthesizes incorrect transition or reward logic, leading to misleading optimism and wasted exploration.
  • Optimistic models that are too wrong cause repeated failed plans before being corrected.
  • Complex planning horizons (e.g., hard Sokoban levels) can make solver search fail even with a correct model.
  • Symbol grounding errors: bad object detection or wrong symbolic inputs break the synthesized program.

Core Entities

Models

  • GPT-4

Metrics

  • environment steps (sample complexity)
  • task success rate
  • LLM calls / tokens (compute cost)
  • episodes-to-first-reward

Datasets

  • Sokoban (gym-sokoban)
  • MiniGrid
  • ALFWorld

Benchmarks

  • Sokoban
  • MiniGrid
  • AlfWorld

Context Entities

Models

  • ReAct-style LLM agents
  • PPO
  • DreamerV3