CLEANER: replace failed in-rollout code with model self-corrections to purify trajectories and speed agentic RL

January 21, 20267 min

Overview

Decision SnapshotReady For Pilot

CLEANER offers a practical, low-cost way to remove execution-noise from RL rollouts; evidence shows multi-benchmark gains and faster convergence, but success relies on the model producing self-corrections during rollouts.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Tianshi Xu, Yuteng Chen, Meng Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CLEANER reduces rollout noise so small, cheaper LLMs learn tool use faster. That lowers compute cost and shortens training cycles while keeping competitive performance.

Who Should Care

Summary TLDR

Small LLM agents that call tools (e.g., a Python interpreter) produce many execution failures during RL exploration. CLEANER fixes this by building "self-purified" trajectories: when the model later self-corrects, CLEANER rolls the history back and replaces the failed step with the corrected code. The Similarity-Aware Adaptive Rollback (SAAR) decides whether to graft the fix onto the original reasoning or replace the reasoning based on code similarity. Results: on evaluated benchmarks CLEANER improves accuracy (AIME avg ≈ +6% Pass@1, GPQA +3%, LiveCodeBench +5%), suppresses tool errors, and matches SOTA while using about one-third of RL steps. Implementation uses GRPO group updates, SGLang/R

Problem Statement

Parameter-constrained LLM agents (4B–7B) generate many failed tool executions during RL exploration. Under sparse outcome-only rewards, whole trajectories with later success still reinforce the earlier failures, polluting learning. Supersampling to filter rollouts is too costly, and dense intermediate rewards invite reward-hacking. We need a low-cost way to remove execution-noise from trajectories so the policy learns correct reasoning.

Main Contribution

CLEANER: a data-level trajectory purification method that replaces failed tool calls with later self-corrections before optimization.

SAAR (Similarity-Aware Adaptive Rollback): an adaptive rollback that chooses shallow or deep replacement based on code similarity.

Key Findings

Purified trajectories raise AIME accuracy for 4B model

NumbersAIME24 Pass@1: 66.7 -> 72.7 (+6.0)

Practical UseIf you purify rollouts with SAAR, small models (~4B) gain ~6% absolute on hard math benchmarks without bigger models.

Evidence RefTable 1, Table 2

Cleaner reduces AIME25 and LiveCodeBench errors and improves final scores

NumbersAIME25 Pass@1: 59.4 -> 67.1 (+7.7); LiveCodeBench whole: 49.5 -> 54.9 (+5.4)

Practical UseExpect several-point absolute gains on coding and math benchmarks by reducing execution-noise in training data.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AIME24 Pass@1 (Qwen3-4B)72.7 (CLEANER-4B)66.7 (DAPO-baseline)+6.0AIME24Table 1; Table 2Table 1
AIME25 Pass@1 (Qwen3-4B)67.1 (CLEANER-4B)59.4 (DAPO-baseline)+7.7AIME25Table 1; Table 2Table 1

What To Try In 7 Days

Implement a simple rollback: when a tool error is followed by a successful self-correction, replace the failed step in logged trajectories.

Use a string-similarity heuristic (difflib.SequenceMatcher) with γ≈0.5 to choose shallow vs deep replacement.

Set retry limit K=3 for correction attempts to balance recovery and cost.

Agent Features

Memory
short-term trajectory history
Planning
tool planningin-context self-correction (lookahead)
Tool Use
Python code executionexternal execution environment (code judge)
Frameworks
GRPODAPOSGLangVeRLRadixAttention
Is Agentic

Yes

Architectures
single LLM agent + tool (Python interpreter)

Optimization Features

Token Efficiency
reduces wasted tokens in failed tool calls; reallocates tokens to reasoning
Infra Optimization
PyTorch FSDP training; FP16 rollouts; 4× H100/H200 GPUs
System Optimization
logit recomputation using RadixAttention to reuse KV cache
Training Optimization
trajectory purification via rollbackcurriculum mixing (apply SAAR stochastically, e.g., 70%)avoid supersampling to reduce rollout compute
Inference Optimization
optional SAAR at test time for robustness (+8.8% latency)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

GitHub (authors report code and models available)

Data URLs

GitHub (authors report processed datasets and env configs available)

Risks & Boundaries

Limitations

Relies on the model producing successful self-corrections; if the agent rarely recovers, purification yields little benefit.

Similarity measured by difflib.SequenceMatcher is surface-level and may miss semantic code differences.

When Not To Use

When agents cannot produce corrective attempts within K retries.

If you already afford heavy supersampling or large-scale ensemble filtering.

Failure Modes

Incorrect similarity threshold can either preserve bad reasoning or over-delete useful context.

Using negative-sample penalties (online-DPO) destabilized training in development and can collapse optimization.

Core Entities

Models

Qwen3-4B-Instruct-2507Qwen2.5-7B-InstructQwen2.5-72B-Instruct

Metrics

Pass@1Pass@16Average tool failures per trajectoryLatency (min / % increase when SAAR enabled)

Datasets

DAPO-MathSkywork-or1MegaScienceSFT

Benchmarks

AIME24AIME25GPQALiveCodeBench