Train agents to judge actions via RL so they learn true self-reflection, not imitation

Overview

Decision SnapshotReady For Pilot

ACT is a practical RL training stage that integrates with standard pipelines; it requires RL infrastructure and extra data collection but gives measurable gains and better OOD robustness.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 60%

Authors

Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang

Links

Abstract / PDF / Code

Why It Matters For Business

ACT yields consistent, repeatable gains in agent success and robustness by teaching models to judge actions instead of imitating reflections; that improves task completion, reduces failure loops, and helps reuse data across model sizes.

Who Should Care

Product Manager ML Engineer Founder Engineering Lead

Summary TLDR

The paper introduces Agentic Critical Training (ACT): an RL stage that trains an LLM agent to pick the better action from an expert-vs-self candidate pair. ACT forces the model to develop internal reasoning (chain-of-thought) because the only supervision is whether its selection is correct. Across three agent benchmarks (ALFWorld, WebShop, ScienceWorld) ACT used as a precursor to standard RL or IL improves success rates (avg +5.07pp vs IL, +4.62pp vs RL) and yields better out-of-distribution generalization. ACT data can be reused across model sizes and even improves unrelated reasoning benchmarks (MATH-500, GPQA-Diamond) without task-specific reasoning data.

Problem Statement

Imitation learning teaches agents what action to take but not why. Prior fixes add pre-generated 'self-reflection' text and train the model to mimic it, which produces imitation of reflection instead of an internal ability to reason. The paper asks: can we train agents to autonomously learn to judge action quality so that reasoning emerges and improves downstream action generation and generalization?

Main Contribution

Agentic Critical Training (ACT): an RL objective that trains the model to choose the better action in expert-vs-model candidate pairs.

A three-stage pipeline: collect alternatives from an initial policy, train judgement via GRPO, then fine-tune action generation with RL/IL using the improved model.

Key Findings

ACT as a pre-stage improves action generation when combined with IL and RL.

NumbersAvg +5.07 percentage points vs IL; +4.62pp vs RL

Practical UseAdd an ACT phase before standard RL or IL to get ~4–5pp higher success rates on agent tasks.

Evidence RefTable 1

ACT outperforms the 'Early Experience' reflection-distillation baseline.

NumbersAvg +2.42 percentage points over Early Experience

Practical UsePrefer ACT (outcome-based judgement training) over training on distilled reflection text to improve agent performance.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ALFWorld success rate (Qwen3-8B, ID)	92.86%	Imitation Learning 85.71%	+7.15pp	ALFWorld ID	Table 1: RL w/ ACT = 92.86% vs IL = 85.71%	Table 1
ALFWorld success rate (Qwen3-8B, OOD)	88.06%	Imitation Learning 82.84%	+5.22pp	ALFWorld OOD	Table 1: RL w/ ACT = 88.06% vs IL = 82.84%	Table 1

What To Try In 7 Days

Collect alternative actions from your current policy for a subset of states and build contrast pairs (expert vs model).

Run a short ACT stage with GRPO or group-based RL on those pairs to train judgement before your usual RL/finetuning.

Evaluate on a held-out distribution to check OOD gains and inspect traces for self-critique behavior.

Agent Features

Planning

Planning with LLMs

Frameworks

GRPO

Is Agentic

Yes

Architectures

LLM agent (Qwen3 family)

Optimization Features

Infra Optimization

DeepSpeed ZeRO-3

Training Optimization

GRPO

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Code URLs

https://attention-is-all-i-need.github.io/ACT/

Risks & Boundaries

Limitations

Requires collecting alternative actions from an initial policy; collection cost can be nontrivial for large environments.

ACT alone does not generate actions; it must be followed by RL or IL for production action generation.

When Not To Use

You lack RL infrastructure or cannot sample alternative actions from any initial policy.

Your action space is highly open-ended and alternatives cannot be reliably enumerated (some WebShop settings).

Failure Modes

If sampled alternatives are often better than expert actions, ACT may teach the model misleading comparisons.

Formatting-dependent rewards (missing <action> tags) can penalize correct reasoning if the model fails to follow output constraints.

Core Entities

Models

Qwen3-8BQwen3-4B

Metrics

success rate (%)Accuracy

Datasets

ALFWorldWebShopScienceWorldMATH-500GPQA-Diamond

Benchmarks

ALFWorldWebShopScienceWorldMATH-500GPQA-Diamond

Context Entities

Models

initial policy πθ0

Metrics

in-distribution vs out-of-distribution success rates

Datasets

expert trajectories (collected per benchmark)

Benchmarks

ALFWorld seen/unseen splits (ID/OOD)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ACT as a pre-stage improves action generation when combined with IL and RL.

ACT outperforms the 'Early Experience' reflection-distillation baseline.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding