Train agents to judge actions via RL so they learn true self-reflection, not imitation

March 9, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

0

Authors

Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang

Links

Abstract / PDF

Why It Matters For Business

ACT yields consistent, repeatable gains in agent success and robustness by teaching models to judge actions instead of imitating reflections; that improves task completion, reduces failure loops, and helps reuse data across model sizes.

Summary TLDR

The paper introduces Agentic Critical Training (ACT): an RL stage that trains an LLM agent to pick the better action from an expert-vs-self candidate pair. ACT forces the model to develop internal reasoning (chain-of-thought) because the only supervision is whether its selection is correct. Across three agent benchmarks (ALFWorld, WebShop, ScienceWorld) ACT used as a precursor to standard RL or IL improves success rates (avg +5.07pp vs IL, +4.62pp vs RL) and yields better out-of-distribution generalization. ACT data can be reused across model sizes and even improves unrelated reasoning benchmarks (MATH-500, GPQA-Diamond) without task-specific reasoning data.

Problem Statement

Imitation learning teaches agents what action to take but not why. Prior fixes add pre-generated 'self-reflection' text and train the model to mimic it, which produces imitation of reflection instead of an internal ability to reason. The paper asks: can we train agents to autonomously learn to judge action quality so that reasoning emerges and improves downstream action generation and generalization?

Main Contribution

Agentic Critical Training (ACT): an RL objective that trains the model to choose the better action in expert-vs-model candidate pairs.

A three-stage pipeline: collect alternatives from an initial policy, train judgement via GRPO, then fine-tune action generation with RL/IL using the improved model.

Empirical results across three agentic benchmarks showing consistent gains, better OOD transfer, cross-size data transferability, and improvements on general reasoning benchmarks without explicit reasoning data.

Key Findings

ACT as a pre-stage improves action generation when combined with IL and RL.

NumbersAvg +5.07 percentage points vs IL; +4.62pp vs RL

ACT outperforms the 'Early Experience' reflection-distillation baseline.

NumbersAvg +2.42 percentage points over Early Experience

ACT gives stronger out-of-distribution gains than in-distribution gains on ALFWorld.

NumbersRL w/ ACT gain: OOD +3.73pp vs ID +2.15pp

ACT transfers to general reasoning benchmarks despite training only on agentic data.

NumbersGPQA +1.85pp; MATH +0.8pp vs CoT prompt baseline

ACT produces observable self-critique and failure-recovery behavior that IL lacks.

NumbersQualitative case studies: IL loops on failures; ACT diagnoses and recovers (Figure 3)

Results

ALFWorld success rate (Qwen3-8B, ID)

Value92.86%

BaselineImitation Learning 85.71%

ALFWorld success rate (Qwen3-8B, OOD)

Value88.06%

BaselineImitation Learning 82.84%

WebShop success rate (Qwen3-8B)

Value33.8%

BaselineImitation Learning 28.0%

Accuracy

Value50.34%

BaselineImitation Learning 42.8%

Average improvement vs IL (all benchmarks)

Value5.07pp

BaselineImitation Learning

Accuracy

Value53.37% ± 0.63

BaselinePrompt w/ CoT 51.52% ± 1.89

Who Should Care

What To Try In 7 Days

Collect alternative actions from your current policy for a subset of states and build contrast pairs (expert vs model).

Run a short ACT stage with GRPO or group-based RL on those pairs to train judgement before your usual RL/finetuning.

Evaluate on a held-out distribution to check OOD gains and inspect traces for self-critique behavior.

Agent Features

Planning

  • Planning with LLMs

Frameworks

  • GRPO

Is Agentic

true

Architectures

  • LLM agent (Qwen3 family)

Optimization Features

Infra Optimization

  • DeepSpeed ZeRO-3

Training Optimization

  • GRPO

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires collecting alternative actions from an initial policy; collection cost can be nontrivial for large environments.
  • ACT alone does not generate actions; it must be followed by RL or IL for production action generation.
  • Assumes expert actions are generally superior to sampled alternatives; noisy or low-quality expert data will reduce benefit.

When Not To Use

  • You lack RL infrastructure or cannot sample alternative actions from any initial policy.
  • Your action space is highly open-ended and alternatives cannot be reliably enumerated (some WebShop settings).
  • Expert trajectories are scarce or low quality and cannot support reliable contrast pairs.

Failure Modes

  • If sampled alternatives are often better than expert actions, ACT may teach the model misleading comparisons.
  • Formatting-dependent rewards (missing <action> tags) can penalize correct reasoning if the model fails to follow output constraints.
  • Over-reliance on ACT without sufficient action-generation fine-tuning may leave the agent unable to act despite good judgement.

Core Entities

Models

  • Qwen3-8B
  • Qwen3-4B

Metrics

  • success rate (%)
  • Accuracy

Datasets

  • ALFWorld
  • WebShop
  • ScienceWorld
  • MATH-500
  • GPQA-Diamond

Benchmarks

  • ALFWorld
  • WebShop
  • ScienceWorld
  • MATH-500
  • GPQA-Diamond

Context Entities

Models

  • initial policy πθ0

Metrics

  • in-distribution vs out-of-distribution success rates

Datasets

  • expert trajectories (collected per benchmark)

Benchmarks

  • ALFWorld seen/unseen splits (ID/OOD)