Overview
The method is practical: cold-start SFT plus an RL phase yields measurable token and accuracy gains on five benchmarks, but it relies on synthetic data and environment-specific RL tuning.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Omitting unnecessary internal reasoning and old tool outputs reduces API token costs and latency while keeping or improving task success, giving a better cost-performance trade-off for production agents.
Who Should Care
Summary TLDR
The paper studies which agent turns (thoughts and observations) really matter, and trains an LLM agent (Agent-Omit) that learns to omit unnecessary internal reasoning or prior tool responses. They synthesize cold-start omission data, then apply an omit-aware RL loop (dual sampling + omission reward + KL penalty). On five agent benchmarks Agent-Omit-8B (RL) keeps or improves Pass@1 accuracy vs strong baselines while lowering average token use. Trained agents omit about 3–4 turns, mostly in middle turns. Code and data are provided.
Problem Statement
Multi-turn LLM agents spend most tokens on internal thoughts and stacked observations. Not all turns matter equally. The problem is to learn when to omit thoughts or past observations to reduce tokens while keeping task accuracy across diverse agent environments.
Main Contribution
Turn-level analysis showing thought and observation token cost dominate agent context and that their utility varies by turn.
Agent-Omit framework: cold-start synthesis of single- and multi-turn omission samples plus omit-aware agentic RL (dual sampling, omission reward, KL penalty).
Key Findings
Thought and observation tokens dominate agent context.
Selective omission can cut tokens without hurting accuracy on many turns.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Token composition (WebShop, Qwen3-8B) | Thought 45.1%; Observation 52.2%; Actions 2.7% | — | — | WebShop | Fig.2(a) | Sec.3.1 |
| Agent-Omit-8B-RL Pass@1 | WebShop 23.57; TextCraft 87.00; BabyAI 84.36; SciWorld 18.45; DeepSearch 26.56 | Various frontier LLM agents (Table 2) | Improves vs many frontier models on WebShop/TextCraft/BabyAI/SciWorld | Multiple (Table 2) | Table 2 main results | Sec.6.2 |
What To Try In 7 Days
Add a simple omit token/flag to agent outputs and target mid-turn omissions in a dev benchmark.
Synthesize 2–4K cold-start omission examples and fine-tune the agent to accept empty <think> or <omit tool response> tokens.
Implement a lightweight omit-aware reward (saved_tokens ratio with task-correctness gating) and run a few RL rollouts to tune the omission weight.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Omission harms initial and final turns; must be selective not global.
Relies on synthetic cold-start data and environment-specific RL; generalization to unseen environments is untested.
When Not To Use
Tasks where every past observation is safety-critical or legally required.
Single-turn tasks where omission brings no benefit.
Failure Modes
Removing a needed thought/observation and forcing the agent to generate extra recovery reasoning.
Reward hacking if omission reward is not gated by correctness (authors set R_omit=0 when R_task=0 to avoid this).

