Train agents to judge actions via RL so they learn true self-reflection, not imitation

March 9, 20268 min

Overview

Decision SnapshotReady For Pilot

ACT is a practical RL training stage that integrates with standard pipelines; it requires RL infrastructure and extra data collection but gives measurable gains and better OOD robustness.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 60%

Authors

Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang

Links

Abstract / PDF / Code

Why It Matters For Business

ACT yields consistent, repeatable gains in agent success and robustness by teaching models to judge actions instead of imitating reflections; that improves task completion, reduces failure loops, and helps reuse data across model sizes.

Who Should Care

Summary TLDR

The paper introduces Agentic Critical Training (ACT): an RL stage that trains an LLM agent to pick the better action from an expert-vs-self candidate pair. ACT forces the model to develop internal reasoning (chain-of-thought) because the only supervision is whether its selection is correct. Across three agent benchmarks (ALFWorld, WebShop, ScienceWorld) ACT used as a precursor to standard RL or IL improves success rates (avg +5.07pp vs IL, +4.62pp vs RL) and yields better out-of-distribution generalization. ACT data can be reused across model sizes and even improves unrelated reasoning benchmarks (MATH-500, GPQA-Diamond) without task-specific reasoning data.

Problem Statement

Imitation learning teaches agents what action to take but not why. Prior fixes add pre-generated 'self-reflection' text and train the model to mimic it, which produces imitation of reflection instead of an internal ability to reason. The paper asks: can we train agents to autonomously learn to judge action quality so that reasoning emerges and improves downstream action generation and generalization?

Main Contribution

Agentic Critical Training (ACT): an RL objective that trains the model to choose the better action in expert-vs-model candidate pairs.

A three-stage pipeline: collect alternatives from an initial policy, train judgement via GRPO, then fine-tune action generation with RL/IL using the improved model.

Key Findings

ACT as a pre-stage improves action generation when combined with IL and RL.

NumbersAvg +5.07 percentage points vs IL; +4.62pp vs RL

Practical UseAdd an ACT phase before standard RL or IL to get ~4–5pp higher success rates on agent tasks.

Evidence RefTable 1

ACT outperforms the 'Early Experience' reflection-distillation baseline.

NumbersAvg +2.42 percentage points over Early Experience

Practical UsePrefer ACT (outcome-based judgement training) over training on distilled reflection text to improve agent performance.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ALFWorld success rate (Qwen3-8B, ID)92.86%Imitation Learning 85.71%+7.15ppALFWorld IDTable 1: RL w/ ACT = 92.86% vs IL = 85.71%Table 1
ALFWorld success rate (Qwen3-8B, OOD)88.06%Imitation Learning 82.84%+5.22ppALFWorld OODTable 1: RL w/ ACT = 88.06% vs IL = 82.84%Table 1

What To Try In 7 Days

Collect alternative actions from your current policy for a subset of states and build contrast pairs (expert vs model).

Run a short ACT stage with GRPO or group-based RL on those pairs to train judgement before your usual RL/finetuning.

Evaluate on a held-out distribution to check OOD gains and inspect traces for self-critique behavior.

Agent Features

Planning
Planning with LLMs
Frameworks
GRPO
Is Agentic

Yes

Architectures
LLM agent (Qwen3 family)

Optimization Features

Infra Optimization
DeepSpeed ZeRO-3
Training Optimization
GRPO

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Requires collecting alternative actions from an initial policy; collection cost can be nontrivial for large environments.

ACT alone does not generate actions; it must be followed by RL or IL for production action generation.

When Not To Use

You lack RL infrastructure or cannot sample alternative actions from any initial policy.

Your action space is highly open-ended and alternatives cannot be reliably enumerated (some WebShop settings).

Failure Modes

If sampled alternatives are often better than expert actions, ACT may teach the model misleading comparisons.

Formatting-dependent rewards (missing <action> tags) can penalize correct reasoning if the model fails to follow output constraints.

Core Entities

Models

Qwen3-8BQwen3-4B

Metrics

success rate (%)Accuracy

Datasets

ALFWorldWebShopScienceWorldMATH-500GPQA-Diamond

Benchmarks

ALFWorldWebShopScienceWorldMATH-500GPQA-Diamond

Context Entities

Models

initial policy πθ0

Metrics

in-distribution vs out-of-distribution success rates

Datasets

expert trajectories (collected per benchmark)

Benchmarks

ALFWorld seen/unseen splits (ID/OOD)