CLEANER: replace failed in-rollout code with model self-corrections to purify trajectories and speed agentic RL

January 21, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

0

Authors

Tianshi Xu, Yuteng Chen, Meng Li

Links

Abstract / PDF

Why It Matters For Business

CLEANER reduces rollout noise so small, cheaper LLMs learn tool use faster. That lowers compute cost and shortens training cycles while keeping competitive performance.

Summary TLDR

Small LLM agents that call tools (e.g., a Python interpreter) produce many execution failures during RL exploration. CLEANER fixes this by building "self-purified" trajectories: when the model later self-corrects, CLEANER rolls the history back and replaces the failed step with the corrected code. The Similarity-Aware Adaptive Rollback (SAAR) decides whether to graft the fix onto the original reasoning or replace the reasoning based on code similarity. Results: on evaluated benchmarks CLEANER improves accuracy (AIME avg ≈ +6% Pass@1, GPQA +3%, LiveCodeBench +5%), suppresses tool errors, and matches SOTA while using about one-third of RL steps. Implementation uses GRPO group updates, SGLang/R

Problem Statement

Parameter-constrained LLM agents (4B–7B) generate many failed tool executions during RL exploration. Under sparse outcome-only rewards, whole trajectories with later success still reinforce the earlier failures, polluting learning. Supersampling to filter rollouts is too costly, and dense intermediate rewards invite reward-hacking. We need a low-cost way to remove execution-noise from trajectories so the policy learns correct reasoning.

Main Contribution

CLEANER: a data-level trajectory purification method that replaces failed tool calls with later self-corrections before optimization.

SAAR (Similarity-Aware Adaptive Rollback): an adaptive rollback that chooses shallow or deep replacement based on code similarity.

Empirical demonstration that purified trajectories reduce tool errors, improve accuracy across AIME, GPQA, LiveCodeBench, and match SOTA using one-third of RL steps.

Practical implementation details and reproducible pipeline (SGLang, RadixAttention, GRPO/DAPO) provided.

Key Findings

Purified trajectories raise AIME accuracy for 4B model

NumbersAIME24 Pass@1: 66.7 -> 72.7 (+6.0)

Cleaner reduces AIME25 and LiveCodeBench errors and improves final scores

NumbersAIME25 Pass@1: 59.4 -> 67.1 (+7.7); LiveCodeBench whole: 49.5 -> 54.9 (+5.4)

Training efficiency: matches SOTA with far fewer RL steps

NumbersMatches DemyAgent SOTA while using ~1/3 training steps

SAAR stabilizes and recovers suboptimal policies

NumbersPost-introduction recovery: AIME24 +5.2%, AIME25 +1.0%

SAAR adds small inference overhead when enabled at test time

NumbersAverage latency increase ≈ 8.8%

Results

AIME24 Pass@1 (Qwen3-4B)

Value72.7 (CLEANER-4B)

Baseline66.7 (DAPO-baseline)

AIME25 Pass@1 (Qwen3-4B)

Value67.1 (CLEANER-4B)

Baseline59.4 (DAPO-baseline)

Accuracy

Value60.2 (CLEANER-4B)

Baseline56.9 (DAPO-baseline)

LiveCodeBench whole (Qwen3-4B)

Value54.9 (CLEANER-4B)

Baseline49.5 (DAPO-baseline)

Training steps efficiency

ValueMatches SOTA while using ~1/3 RL steps

BaselineDemyAgent-4B using 750 steps

Who Should Care

What To Try In 7 Days

Implement a simple rollback: when a tool error is followed by a successful self-correction, replace the failed step in logged trajectories.

Use a string-similarity heuristic (difflib.SequenceMatcher) with γ≈0.5 to choose shallow vs deep replacement.

Set retry limit K=3 for correction attempts to balance recovery and cost.

Agent Features

Memory

  • short-term trajectory history

Planning

  • tool planning
  • in-context self-correction (lookahead)

Tool Use

  • Python code execution
  • external execution environment (code judge)

Frameworks

  • GRPO
  • DAPO
  • SGLang
  • VeRL
  • RadixAttention

Is Agentic

true

Architectures

  • single LLM agent + tool (Python interpreter)

Optimization Features

Token Efficiency

  • reduces wasted tokens in failed tool calls; reallocates tokens to reasoning

Infra Optimization

  • PyTorch FSDP training; FP16 rollouts; 4× H100/H200 GPUs

System Optimization

  • logit recomputation using RadixAttention to reuse KV cache

Training Optimization

  • trajectory purification via rollback
  • curriculum mixing (apply SAAR stochastically, e.g., 70%)
  • avoid supersampling to reduce rollout compute

Inference Optimization

  • optional SAAR at test time for robustness (+8.8% latency)

Reproducibility

Code Urls

  • GitHub (authors report code and models available)

Data Urls

  • GitHub (authors report processed datasets and env configs available)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on the model producing successful self-corrections; if the agent rarely recovers, purification yields little benefit.
  • Similarity measured by difflib.SequenceMatcher is surface-level and may miss semantic code differences.
  • Method focuses on Python interpreter actions; transfer to other tool types needs validation.
  • SAAR introduces extra rollout logic and small latency when enabled at inference.

When Not To Use

  • When agents cannot produce corrective attempts within K retries.
  • If you already afford heavy supersampling or large-scale ensemble filtering.
  • When tool actions are non-code or involve opaque external APIs that can’t be meaningfully compared by text similarity.

Failure Modes

  • Incorrect similarity threshold can either preserve bad reasoning or over-delete useful context.
  • Using negative-sample penalties (online-DPO) destabilized training in development and can collapse optimization.
  • Post-hoc insertion of SAAR recovers some stability but may not reach models trained with SAAR from scratch.

Core Entities

Models

  • Qwen3-4B-Instruct-2507
  • Qwen2.5-7B-Instruct
  • Qwen2.5-72B-Instruct

Metrics

  • Pass@1
  • Pass@16
  • Average tool failures per trajectory
  • Latency (min / % increase when SAAR enabled)

Datasets

  • DAPO-Math
  • Skywork-or1
  • MegaScience
  • SFT

Benchmarks

  • AIME24
  • AIME25
  • GPQA
  • LiveCodeBench