Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.45
Citation Count
0
Why It Matters For Business
ACT yields consistent, repeatable gains in agent success and robustness by teaching models to judge actions instead of imitating reflections; that improves task completion, reduces failure loops, and helps reuse data across model sizes.
Summary TLDR
The paper introduces Agentic Critical Training (ACT): an RL stage that trains an LLM agent to pick the better action from an expert-vs-self candidate pair. ACT forces the model to develop internal reasoning (chain-of-thought) because the only supervision is whether its selection is correct. Across three agent benchmarks (ALFWorld, WebShop, ScienceWorld) ACT used as a precursor to standard RL or IL improves success rates (avg +5.07pp vs IL, +4.62pp vs RL) and yields better out-of-distribution generalization. ACT data can be reused across model sizes and even improves unrelated reasoning benchmarks (MATH-500, GPQA-Diamond) without task-specific reasoning data.
Problem Statement
Imitation learning teaches agents what action to take but not why. Prior fixes add pre-generated 'self-reflection' text and train the model to mimic it, which produces imitation of reflection instead of an internal ability to reason. The paper asks: can we train agents to autonomously learn to judge action quality so that reasoning emerges and improves downstream action generation and generalization?
Main Contribution
Agentic Critical Training (ACT): an RL objective that trains the model to choose the better action in expert-vs-model candidate pairs.
A three-stage pipeline: collect alternatives from an initial policy, train judgement via GRPO, then fine-tune action generation with RL/IL using the improved model.
Empirical results across three agentic benchmarks showing consistent gains, better OOD transfer, cross-size data transferability, and improvements on general reasoning benchmarks without explicit reasoning data.
Key Findings
ACT as a pre-stage improves action generation when combined with IL and RL.
ACT outperforms the 'Early Experience' reflection-distillation baseline.
ACT gives stronger out-of-distribution gains than in-distribution gains on ALFWorld.
ACT transfers to general reasoning benchmarks despite training only on agentic data.
ACT produces observable self-critique and failure-recovery behavior that IL lacks.
Results
ALFWorld success rate (Qwen3-8B, ID)
ALFWorld success rate (Qwen3-8B, OOD)
WebShop success rate (Qwen3-8B)
Accuracy
Average improvement vs IL (all benchmarks)
Accuracy
Who Should Care
What To Try In 7 Days
Collect alternative actions from your current policy for a subset of states and build contrast pairs (expert vs model).
Run a short ACT stage with GRPO or group-based RL on those pairs to train judgement before your usual RL/finetuning.
Evaluate on a held-out distribution to check OOD gains and inspect traces for self-critique behavior.
Agent Features
Planning
- Planning with LLMs
Frameworks
- GRPO
Is Agentic
true
Architectures
- LLM agent (Qwen3 family)
Optimization Features
Infra Optimization
- DeepSpeed ZeRO-3
Training Optimization
- GRPO
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires collecting alternative actions from an initial policy; collection cost can be nontrivial for large environments.
- ACT alone does not generate actions; it must be followed by RL or IL for production action generation.
- Assumes expert actions are generally superior to sampled alternatives; noisy or low-quality expert data will reduce benefit.
When Not To Use
- You lack RL infrastructure or cannot sample alternative actions from any initial policy.
- Your action space is highly open-ended and alternatives cannot be reliably enumerated (some WebShop settings).
- Expert trajectories are scarce or low quality and cannot support reliable contrast pairs.
Failure Modes
- If sampled alternatives are often better than expert actions, ACT may teach the model misleading comparisons.
- Formatting-dependent rewards (missing <action> tags) can penalize correct reasoning if the model fails to follow output constraints.
- Over-reliance on ACT without sufficient action-generation fine-tuning may leave the agent unable to act despite good judgement.
Core Entities
Models
- Qwen3-8B
- Qwen3-4B
Metrics
- success rate (%)
- Accuracy
Datasets
- ALFWorld
- WebShop
- ScienceWorld
- MATH-500
- GPQA-Diamond
Benchmarks
- ALFWorld
- WebShop
- ScienceWorld
- MATH-500
- GPQA-Diamond
Context Entities
Models
- initial policy πθ0
Metrics
- in-distribution vs out-of-distribution success rates
Datasets
- expert trajectories (collected per benchmark)
Benchmarks
- ALFWorld seen/unseen splits (ID/OOD)

