Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
CLEANER reduces rollout noise so small, cheaper LLMs learn tool use faster. That lowers compute cost and shortens training cycles while keeping competitive performance.
Summary TLDR
Small LLM agents that call tools (e.g., a Python interpreter) produce many execution failures during RL exploration. CLEANER fixes this by building "self-purified" trajectories: when the model later self-corrects, CLEANER rolls the history back and replaces the failed step with the corrected code. The Similarity-Aware Adaptive Rollback (SAAR) decides whether to graft the fix onto the original reasoning or replace the reasoning based on code similarity. Results: on evaluated benchmarks CLEANER improves accuracy (AIME avg ≈ +6% Pass@1, GPQA +3%, LiveCodeBench +5%), suppresses tool errors, and matches SOTA while using about one-third of RL steps. Implementation uses GRPO group updates, SGLang/R
Problem Statement
Parameter-constrained LLM agents (4B–7B) generate many failed tool executions during RL exploration. Under sparse outcome-only rewards, whole trajectories with later success still reinforce the earlier failures, polluting learning. Supersampling to filter rollouts is too costly, and dense intermediate rewards invite reward-hacking. We need a low-cost way to remove execution-noise from trajectories so the policy learns correct reasoning.
Main Contribution
CLEANER: a data-level trajectory purification method that replaces failed tool calls with later self-corrections before optimization.
SAAR (Similarity-Aware Adaptive Rollback): an adaptive rollback that chooses shallow or deep replacement based on code similarity.
Empirical demonstration that purified trajectories reduce tool errors, improve accuracy across AIME, GPQA, LiveCodeBench, and match SOTA using one-third of RL steps.
Practical implementation details and reproducible pipeline (SGLang, RadixAttention, GRPO/DAPO) provided.
Key Findings
Purified trajectories raise AIME accuracy for 4B model
Cleaner reduces AIME25 and LiveCodeBench errors and improves final scores
Training efficiency: matches SOTA with far fewer RL steps
SAAR stabilizes and recovers suboptimal policies
SAAR adds small inference overhead when enabled at test time
Results
AIME24 Pass@1 (Qwen3-4B)
AIME25 Pass@1 (Qwen3-4B)
Accuracy
LiveCodeBench whole (Qwen3-4B)
Training steps efficiency
Who Should Care
What To Try In 7 Days
Implement a simple rollback: when a tool error is followed by a successful self-correction, replace the failed step in logged trajectories.
Use a string-similarity heuristic (difflib.SequenceMatcher) with γ≈0.5 to choose shallow vs deep replacement.
Set retry limit K=3 for correction attempts to balance recovery and cost.
Agent Features
Memory
- short-term trajectory history
Planning
- tool planning
- in-context self-correction (lookahead)
Tool Use
- Python code execution
- external execution environment (code judge)
Frameworks
- GRPO
- DAPO
- SGLang
- VeRL
- RadixAttention
Is Agentic
true
Architectures
- single LLM agent + tool (Python interpreter)
Optimization Features
Token Efficiency
- reduces wasted tokens in failed tool calls; reallocates tokens to reasoning
Infra Optimization
- PyTorch FSDP training; FP16 rollouts; 4× H100/H200 GPUs
System Optimization
- logit recomputation using RadixAttention to reuse KV cache
Training Optimization
- trajectory purification via rollback
- curriculum mixing (apply SAAR stochastically, e.g., 70%)
- avoid supersampling to reduce rollout compute
Inference Optimization
- optional SAAR at test time for robustness (+8.8% latency)
Reproducibility
Code Urls
- GitHub (authors report code and models available)
Data Urls
- GitHub (authors report processed datasets and env configs available)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on the model producing successful self-corrections; if the agent rarely recovers, purification yields little benefit.
- Similarity measured by difflib.SequenceMatcher is surface-level and may miss semantic code differences.
- Method focuses on Python interpreter actions; transfer to other tool types needs validation.
- SAAR introduces extra rollout logic and small latency when enabled at inference.
When Not To Use
- When agents cannot produce corrective attempts within K retries.
- If you already afford heavy supersampling or large-scale ensemble filtering.
- When tool actions are non-code or involve opaque external APIs that can’t be meaningfully compared by text similarity.
Failure Modes
- Incorrect similarity threshold can either preserve bad reasoning or over-delete useful context.
- Using negative-sample penalties (online-DPO) destabilized training in development and can collapse optimization.
- Post-hoc insertion of SAAR recovers some stability but may not reach models trained with SAAR from scratch.
Core Entities
Models
- Qwen3-4B-Instruct-2507
- Qwen2.5-7B-Instruct
- Qwen2.5-72B-Instruct
Metrics
- Pass@1
- Pass@16
- Average tool failures per trajectory
- Latency (min / % increase when SAAR enabled)
Datasets
- DAPO-Math
- Skywork-or1
- MegaScience
- SFT
Benchmarks
- AIME24
- AIME25
- GPQA
- LiveCodeBench

