Overview
ACT is a practical RL training stage that integrates with standard pipelines; it requires RL infrastructure and extra data collection but gives measurable gains and better OOD robustness.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 45%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
ACT yields consistent, repeatable gains in agent success and robustness by teaching models to judge actions instead of imitating reflections; that improves task completion, reduces failure loops, and helps reuse data across model sizes.
Who Should Care
Summary TLDR
The paper introduces Agentic Critical Training (ACT): an RL stage that trains an LLM agent to pick the better action from an expert-vs-self candidate pair. ACT forces the model to develop internal reasoning (chain-of-thought) because the only supervision is whether its selection is correct. Across three agent benchmarks (ALFWorld, WebShop, ScienceWorld) ACT used as a precursor to standard RL or IL improves success rates (avg +5.07pp vs IL, +4.62pp vs RL) and yields better out-of-distribution generalization. ACT data can be reused across model sizes and even improves unrelated reasoning benchmarks (MATH-500, GPQA-Diamond) without task-specific reasoning data.
Problem Statement
Imitation learning teaches agents what action to take but not why. Prior fixes add pre-generated 'self-reflection' text and train the model to mimic it, which produces imitation of reflection instead of an internal ability to reason. The paper asks: can we train agents to autonomously learn to judge action quality so that reasoning emerges and improves downstream action generation and generalization?
Main Contribution
Agentic Critical Training (ACT): an RL objective that trains the model to choose the better action in expert-vs-model candidate pairs.
A three-stage pipeline: collect alternatives from an initial policy, train judgement via GRPO, then fine-tune action generation with RL/IL using the improved model.
Key Findings
ACT as a pre-stage improves action generation when combined with IL and RL.
ACT outperforms the 'Early Experience' reflection-distillation baseline.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ALFWorld success rate (Qwen3-8B, ID) | 92.86% | Imitation Learning 85.71% | +7.15pp | ALFWorld ID | Table 1: RL w/ ACT = 92.86% vs IL = 85.71% | Table 1 |
| ALFWorld success rate (Qwen3-8B, OOD) | 88.06% | Imitation Learning 82.84% | +5.22pp | ALFWorld OOD | Table 1: RL w/ ACT = 88.06% vs IL = 82.84% | Table 1 |
What To Try In 7 Days
Collect alternative actions from your current policy for a subset of states and build contrast pairs (expert vs model).
Run a short ACT stage with GRPO or group-based RL on those pairs to train judgement before your usual RL/finetuning.
Evaluate on a held-out distribution to check OOD gains and inspect traces for self-critique behavior.
Agent Features
Planning
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Infra Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires collecting alternative actions from an initial policy; collection cost can be nontrivial for large environments.
ACT alone does not generate actions; it must be followed by RL or IL for production action generation.
When Not To Use
You lack RL infrastructure or cannot sample alternative actions from any initial policy.
Your action space is highly open-ended and alternatives cannot be reliably enumerated (some WebShop settings).
Failure Modes
If sampled alternatives are often better than expert actions, ACT may teach the model misleading comparisons.
Formatting-dependent rewards (missing <action> tags) can penalize correct reasoning if the model fails to follow output constraints.

