C-MCTS: prune unsafe MCTS branches using an offline-trained safety critic to plan closer to safety limits

May 25, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

2

Authors

Dinesh Parthasarathy, Georgios Kontes, Axel Plinge, Christopher Mutschler

Links

Abstract / PDF

Why It Matters For Business

Pretraining a safety critic in a realistic simulator lets planners run faster and closer to safety limits with fewer violations, which reduces costly failures and improves mission rewards in safety-critical decision systems.

Summary TLDR

C-MCTS augments Monte Carlo Tree Search with a safety critic trained offline in a high-fidelity simulator. During planning the critic predicts expected future cost and prunes tree branches likely to violate a cost constraint. This reduces variance in cost estimates, speeds up planning (fewer iterations to reach deep search), and lets the agent operate closer to the safety boundary with fewer constraint violations under model mismatch. Evaluated on Rocksample and a Safe Gridworld, C-MCTS outperforms a prior constrained MCTS baseline (CC-MCP) on reward and safety in the tested scenarios.

Problem Statement

Standard MCTS optimizes reward only and cannot enforce safety constraints. Prior constrained MCTS (CC-MCP) tunes a Lagrange multiplier online and relies on Monte Carlo cost estimates that have high variance and are sensitive to model mismatch. The paper asks: can we pretrain a safety critic to predict costs and prune unsafe branches during MCTS deployment, so planning is faster, less conservative, and more robust to simulator mismatch?

Main Contribution

C-MCTS algorithm that uses an ensemble safety critic trained offline to predict state-action costs and prune unsafe MCTS branches at expansion time.

A guided data-collection loop that varies Lagrange multipliers offline to gather cost-critical state-action samples for critic training.

Empirical evaluation on Rocksample and Safe Gridworld showing higher rewards, fewer cost violations, and deeper search trees per planning budget compared to CC-MCP.

Key Findings

C-MCTS achieves higher average rewards than CC-MCP on evaluated Rocksample instances.

NumbersRocksample(7,8): reward 11.0 vs 9.83; Rocksample(11,11): 7.14 vs 5.26 (Table 3)

C-MCTS incurs fewer cost violations while operating closer to the cost limit.

NumbersEvaluated over 100 episodes: lower total cost violations and cost stays below constraint (Fig.2 middle/bottom rows)

C-MCTS reaches deeper search-tree depths with the same number of planning iterations.

NumbersPeak tree depth highest for C-MCTS in Rocksample experiments (Fig.5 averaged over 100 episodes)

Results

discounted reward

Value11.0

BaselineCC-MCP 9.83

discounted reward

Value7.14

BaselineCC-MCP 5.26

cost violations

Valuefewer violations than CC-MCP

BaselineCC-MCP

search tree depth

Valuehigher peak depth

BaselineCC-MCP and MCTS

Who Should Care

What To Try In 7 Days

Implement a small ensemble safety critic (TD-trained) for a simulator you already have and plug it into MCTS expansion to prune high-cost branches.

Collect offline trajectories under varied safety penalties (vary Lagrange λ) to cover cost-critical states before training the critic.

Run ablations with ensemble uncertainty threshold to balance safety vs conservatism (raise σ_max if overly conservative).

Agent Features

Planning

  • safety-guided pruning at expansion
  • offline critic + online low-fidelity planner

Tool Use

  • high-fidelity simulator for offline data
  • low-fidelity simulator for online planning

Frameworks

  • SARSA(0) for critic training
  • ensemble uncertainty (mean, std) for OOD detection

Is Agentic

true

Architectures

  • Monte Carlo Tree Search
  • ensemble safety critic (neural nets)

Optimization Features

Infra Optimization

  • CPU-only implementation reported; expansion predictions parallelizable

System Optimization

  • use low-fidelity model for fast online planning and high-fidelity model offline for safety
  • ensemble predictions parallelizable for GPU

Training Optimization

  • offline pretraining of safety critic using TD targets
  • guided data collection by varying Lagrange λ to hit cost-critical states

Inference Optimization

  • LoRA
  • fewer planning iterations to reach deeper tree

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on access to a higher-fidelity simulator for offline critic training.
  • Critic can under- or over-estimate costs; under-estimates cause unsafe rollouts that require retraining.
  • Performance depends on ensemble thresholding; too tight thresholds make the agent overly conservative.

When Not To Use

  • When you lack a realistic offline simulator to collect cost-critical samples.
  • When you cannot afford retraining after unsafe deployments or collecting more data.
  • When online compute allows using a high-fidelity model directly for planning.

Failure Modes

  • Under-estimated costs by critic lead to unsafe real-world executions until retrained.
  • Over-estimation leads to overly conservative behavior and lower rewards.
  • Large simulation-to-reality mismatch causes the critic to trust inaccurate sensors and move away from constraints.

Core Entities

Models

  • C-MCTS
  • MCTS (vanilla)
  • CC-MCP (Cost-Constrained Monte Carlo Planning)
  • ensemble safety critic (neural networks)

Metrics

  • discounted reward
  • discounted cost
  • number of cost violations
  • planning iterations (simulations)
  • search tree depth

Datasets

  • Rocksample (grid planning environment)
  • Safe Gridworld (proposed grid environment)