Train agents to skip redundant thoughts and past observations to cut token cost while keeping accuracy

February 4, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Yansong Ning, Jun Fang, Naiqiang Tan, Hao Liu

Links

Abstract / PDF

Why It Matters For Business

Omitting unnecessary internal reasoning and old tool outputs reduces API token costs and latency while keeping or improving task success, giving a better cost-performance trade-off for production agents.

Summary TLDR

The paper studies which agent turns (thoughts and observations) really matter, and trains an LLM agent (Agent-Omit) that learns to omit unnecessary internal reasoning or prior tool responses. They synthesize cold-start omission data, then apply an omit-aware RL loop (dual sampling + omission reward + KL penalty). On five agent benchmarks Agent-Omit-8B (RL) keeps or improves Pass@1 accuracy vs strong baselines while lowering average token use. Trained agents omit about 3–4 turns, mostly in middle turns. Code and data are provided.

Problem Statement

Multi-turn LLM agents spend most tokens on internal thoughts and stacked observations. Not all turns matter equally. The problem is to learn when to omit thoughts or past observations to reduce tokens while keeping task accuracy across diverse agent environments.

Main Contribution

Turn-level analysis showing thought and observation token cost dominate agent context and that their utility varies by turn.

Agent-Omit framework: cold-start synthesis of single- and multi-turn omission samples plus omit-aware agentic RL (dual sampling, omission reward, KL penalty).

Theoretical bound: omission policy deviation is upper-bounded by KL divergence between learned and optimal policies.

Empirical results on five benchmarks showing Agent-Omit-8B-RL matches or outperforms strong agents on accuracy while reducing token use; agent typically omits 3–4 turns.

Key Findings

Thought and observation tokens dominate agent context.

NumbersThought 45.1% of tokens; Observation 52.2%; Actions 2.7% (WebShop, Qwen3-8B)

Selective omission can cut tokens without hurting accuracy on many turns.

NumbersGrey regions in Fig.3 show token drop with no accuracy loss on intermediate turns (WebShop, Qwen3-8B)

Agent-Omit-8B-RL improves accuracy and lowers token cost vs strong baselines.

NumbersWebShop Pass@1 23.57 and Avg Tok 8,764 vs DeepSeek-R1 Pass@1 19.37 and Tok 11,308 (Table 2)

Trained agents omit multiple turns, mainly in the middle of trajectories.

NumbersAgents omit on average 3–4 turns; omission frequency peaks in intermediate turns (turns 3–10)

Results

Token composition (WebShop, Qwen3-8B)

ValueThought 45.1%; Observation 52.2%; Actions 2.7%

Agent-Omit-8B-RL Pass@1

ValueWebShop 23.57; TextCraft 87.00; BabyAI 84.36; SciWorld 18.45; DeepSearch 26.56

BaselineVarious frontier LLM agents (Table 2)

Agent-Omit-8B-RL Avg Tokens

ValueWebShop 8,764; TextCraft 7,328; BabyAI 6,643; SciWorld 9,643; DeepSearch 4,356

BaselineDeepSeek-R1-0528 and others (see Table 2)

Average omission volume per trajectory

Value3–4 turns omitted on average

Who Should Care

What To Try In 7 Days

Add a simple omit token/flag to agent outputs and target mid-turn omissions in a dev benchmark.

Synthesize 2–4K cold-start omission examples and fine-tune the agent to accept empty <think> or <omit tool response> tokens.

Implement a lightweight omit-aware reward (saved_tokens ratio with task-correctness gating) and run a few RL rollouts to tune the omission weight.

Agent Features

Memory

  • explicit omission of prior tool responses
  • hierarchical single- and multi-turn omission handling

Planning

  • adaptive omission policy
  • turn-level planning with omit decisions
  • dual sampling (full and partial trajectories)

Tool Use

  • search engine calls
  • web navigation actions
  • game/environment actions

Frameworks

  • AgentGym-RL
  • GRPO

Is Agentic

true

Architectures

  • Qwen3-8B
  • Qwen3-4B
  • Agent-Omit-8B

Optimization Features

Token Efficiency

  • explicit omission reward proportional to saved tokens
  • agents omit 3–4 turns on average

Model Optimization

  • full-parameter fine-tuning for omission behavior

System Optimization

  • SFT

Training Optimization

  • cold-start synthetic omission dataset (2–4K samples)
  • omit-aware RL with dual sampling
  • KL penalty to constrain policy shift

Inference Optimization

  • omit tool responses and empty thoughts to reduce context length
  • omit-aware action formatting (<omit tool response> tokens)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Omission harms initial and final turns; must be selective not global.
  • Relies on synthetic cold-start data and environment-specific RL; generalization to unseen environments is untested.
  • RL stage limited to short training (often one epoch); scaling trade-offs are unclear.
  • Token statistics for closed-source baselines are unavailable, limiting some comparisons.

When Not To Use

  • Tasks where every past observation is safety-critical or legally required.
  • Single-turn tasks where omission brings no benefit.
  • Environments where omissions could remove rare but crucial evidence.

Failure Modes

  • Removing a needed thought/observation and forcing the agent to generate extra recovery reasoning.
  • Reward hacking if omission reward is not gated by correctness (authors set R_omit=0 when R_task=0 to avoid this).
  • Overfitting to synthetic omission patterns causing poor generalization.

Core Entities

Models

  • Agent-Omit-8B-RL
  • SFT
  • Agent-Omit-4B-RL
  • Qwen3-8B
  • Qwen3-4B
  • DeepSeek-R1-0528
  • DeepSeek-V3.2
  • OpenAI o3
  • Qwen3-235B-A22B

Metrics

  • Pass@1
  • Pass@8
  • Average Tokens (Avg Tok)
  • Token Reduction Ratio

Datasets

  • DeepSearch
  • WebShop
  • TextCraft
  • BabyAI
  • SciWorld

Benchmarks

  • WebShop
  • DeepSearch
  • TextCraft
  • BabyAI
  • SciWorld

Context Entities

Models

  • DeepSeek-R1-0528
  • DeepSeek-V3.2
  • OpenAI o4-mini
  • Qwen3-32B
  • Qwen3-Next-80B-A3B