Overview
CLEANER offers a practical, low-cost way to remove execution-noise from RL rollouts; evidence shows multi-benchmark gains and faster convergence, but success relies on the model producing self-corrections during rollouts.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
CLEANER reduces rollout noise so small, cheaper LLMs learn tool use faster. That lowers compute cost and shortens training cycles while keeping competitive performance.
Who Should Care
Summary TLDR
Small LLM agents that call tools (e.g., a Python interpreter) produce many execution failures during RL exploration. CLEANER fixes this by building "self-purified" trajectories: when the model later self-corrects, CLEANER rolls the history back and replaces the failed step with the corrected code. The Similarity-Aware Adaptive Rollback (SAAR) decides whether to graft the fix onto the original reasoning or replace the reasoning based on code similarity. Results: on evaluated benchmarks CLEANER improves accuracy (AIME avg ≈ +6% Pass@1, GPQA +3%, LiveCodeBench +5%), suppresses tool errors, and matches SOTA while using about one-third of RL steps. Implementation uses GRPO group updates, SGLang/R
Problem Statement
Parameter-constrained LLM agents (4B–7B) generate many failed tool executions during RL exploration. Under sparse outcome-only rewards, whole trajectories with later success still reinforce the earlier failures, polluting learning. Supersampling to filter rollouts is too costly, and dense intermediate rewards invite reward-hacking. We need a low-cost way to remove execution-noise from trajectories so the policy learns correct reasoning.
Main Contribution
CLEANER: a data-level trajectory purification method that replaces failed tool calls with later self-corrections before optimization.
SAAR (Similarity-Aware Adaptive Rollback): an adaptive rollback that chooses shallow or deep replacement based on code similarity.
Key Findings
Purified trajectories raise AIME accuracy for 4B model
Cleaner reduces AIME25 and LiveCodeBench errors and improves final scores
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| AIME24 Pass@1 (Qwen3-4B) | 72.7 (CLEANER-4B) | 66.7 (DAPO-baseline) | +6.0 | AIME24 | Table 1; Table 2 | Table 1 |
| AIME25 Pass@1 (Qwen3-4B) | 67.1 (CLEANER-4B) | 59.4 (DAPO-baseline) | +7.7 | AIME25 | Table 1; Table 2 | Table 1 |
What To Try In 7 Days
Implement a simple rollback: when a tool error is followed by a successful self-correction, replace the failed step in logged trajectories.
Use a string-similarity heuristic (difflib.SequenceMatcher) with γ≈0.5 to choose shallow vs deep replacement.
Set retry limit K=3 for correction attempts to balance recovery and cost.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Relies on the model producing successful self-corrections; if the agent rarely recovers, purification yields little benefit.
Similarity measured by difflib.SequenceMatcher is surface-level and may miss semantic code differences.
When Not To Use
When agents cannot produce corrective attempts within K retries.
If you already afford heavy supersampling or large-scale ensemble filtering.
Failure Modes
Incorrect similarity threshold can either preserve bad reasoning or over-delete useful context.
Using negative-sample penalties (online-DPO) destabilized training in development and can collapse optimization.

