CLEANER: replace failed in-rollout code with model self-corrections to purify trajectories and speed agentic RL

Overview

Decision SnapshotReady For Pilot

CLEANER offers a practical, low-cost way to remove execution-noise from RL rollouts; evidence shows multi-benchmark gains and faster convergence, but success relies on the model producing self-corrections during rollouts.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Tianshi Xu, Yuteng Chen, Meng Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CLEANER reduces rollout noise so small, cheaper LLMs learn tool use faster. That lowers compute cost and shortens training cycles while keeping competitive performance.

Who Should Care

ML Engineer Engineering Lead Product Manager CTO

Summary TLDR

Small LLM agents that call tools (e.g., a Python interpreter) produce many execution failures during RL exploration. CLEANER fixes this by building "self-purified" trajectories: when the model later self-corrects, CLEANER rolls the history back and replaces the failed step with the corrected code. The Similarity-Aware Adaptive Rollback (SAAR) decides whether to graft the fix onto the original reasoning or replace the reasoning based on code similarity. Results: on evaluated benchmarks CLEANER improves accuracy (AIME avg ≈ +6% Pass@1, GPQA +3%, LiveCodeBench +5%), suppresses tool errors, and matches SOTA while using about one-third of RL steps. Implementation uses GRPO group updates, SGLang/R

Problem Statement

Parameter-constrained LLM agents (4B–7B) generate many failed tool executions during RL exploration. Under sparse outcome-only rewards, whole trajectories with later success still reinforce the earlier failures, polluting learning. Supersampling to filter rollouts is too costly, and dense intermediate rewards invite reward-hacking. We need a low-cost way to remove execution-noise from trajectories so the policy learns correct reasoning.

Main Contribution

CLEANER: a data-level trajectory purification method that replaces failed tool calls with later self-corrections before optimization.

SAAR (Similarity-Aware Adaptive Rollback): an adaptive rollback that chooses shallow or deep replacement based on code similarity.

Key Findings

Purified trajectories raise AIME accuracy for 4B model

NumbersAIME24 Pass@1: 66.7 -> 72.7 (+6.0)

Practical UseIf you purify rollouts with SAAR, small models (~4B) gain ~6% absolute on hard math benchmarks without bigger models.

Evidence RefTable 1, Table 2

Cleaner reduces AIME25 and LiveCodeBench errors and improves final scores

NumbersAIME25 Pass@1: 59.4 -> 67.1 (+7.7); LiveCodeBench whole: 49.5 -> 54.9 (+5.4)

Practical UseExpect several-point absolute gains on coding and math benchmarks by reducing execution-noise in training data.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
AIME24 Pass@1 (Qwen3-4B)	72.7 (CLEANER-4B)	66.7 (DAPO-baseline)	+6.0	AIME24	Table 1; Table 2	Table 1
AIME25 Pass@1 (Qwen3-4B)	67.1 (CLEANER-4B)	59.4 (DAPO-baseline)	+7.7	AIME25	Table 1; Table 2	Table 1

What To Try In 7 Days

Implement a simple rollback: when a tool error is followed by a successful self-correction, replace the failed step in logged trajectories.

Use a string-similarity heuristic (difflib.SequenceMatcher) with γ≈0.5 to choose shallow vs deep replacement.

Set retry limit K=3 for correction attempts to balance recovery and cost.

Agent Features

Memory

short-term trajectory history

Planning

tool planningin-context self-correction (lookahead)

Tool Use

Python code executionexternal execution environment (code judge)

Frameworks

GRPODAPOSGLangVeRLRadixAttention

Is Agentic

Yes

Architectures

single LLM agent + tool (Python interpreter)

Optimization Features

Token Efficiency

reduces wasted tokens in failed tool calls; reallocates tokens to reasoning

Infra Optimization

PyTorch FSDP training; FP16 rollouts; 4× H100/H200 GPUs

System Optimization

logit recomputation using RadixAttention to reuse KV cache

Training Optimization

trajectory purification via rollbackcurriculum mixing (apply SAAR stochastically, e.g., 70%)avoid supersampling to reduce rollout compute

Inference Optimization

optional SAAR at test time for robustness (+8.8% latency)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

GitHub (authors report code and models available)

Data URLs

GitHub (authors report processed datasets and env configs available)

Risks & Boundaries

Limitations

Relies on the model producing successful self-corrections; if the agent rarely recovers, purification yields little benefit.

Similarity measured by difflib.SequenceMatcher is surface-level and may miss semantic code differences.

When Not To Use

When agents cannot produce corrective attempts within K retries.

If you already afford heavy supersampling or large-scale ensemble filtering.

Failure Modes

Incorrect similarity threshold can either preserve bad reasoning or over-delete useful context.

Using negative-sample penalties (online-DPO) destabilized training in development and can collapse optimization.

Core Entities

Models

Qwen3-4B-Instruct-2507Qwen2.5-7B-InstructQwen2.5-72B-Instruct

Metrics

Pass@1Pass@16Average tool failures per trajectoryLatency (min / % increase when SAAR enabled)

Datasets

DAPO-MathSkywork-or1MegaScienceSFT

Benchmarks

AIME24AIME25GPQALiveCodeBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Purified trajectories raise AIME accuracy for 4B model

Cleaner reduces AIME25 and LiveCodeBench errors and improves final scores

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding