Focus: agent-controlled context compression that cuts token use 22.7% without losing accuracy

January 12, 20267 min

Overview

Production Readiness

0.55

Novelty Score

0.65

Cost Impact Score

0.67

Citation Count

0

Authors

Nikhil Verma

Links

Abstract / PDF

Why It Matters For Business

Active, agent-driven compression can cut API/token costs significantly on exploration-heavy automation tasks without reducing success, helping teams scale agent workflows at lower expense.

Summary TLDR

This paper introduces Focus, a simple agent-level memory manager that lets an LLM decide when to summarize and delete recent interaction logs. Focus adds two tools (start_focus / complete_focus) and a persistent "Knowledge" block. On 5 hard SWE-bench Lite tasks with Claude Haiku 4.5, aggressive prompting to compress every 10–15 tool calls cut total token use by 22.7% (14.9M → 11.5M) while matching baseline task success (3/5). Savings concentrated on exploration-heavy bugs (up to 57% per instance); one iterative task saw higher token use. Key takeaway: explicit, frequent compression can reduce token cost without hurting accuracy for explore-then-implement workflows.

Problem Statement

Long-running LLM agents accumulate large interaction histories. This raises compute cost, increases latency, and can confuse the model with noisy past failures. Existing compression is usually external and passive; agents lack an autonomous, intra-trajectory way to prune raw logs while keeping what they learned.

Main Contribution

Focus agent loop: two primitives (start_focus, complete_focus) that let the model checkpoint, summarize learnings, append a persistent Knowledge block, and delete raw logs.

A practical scaffold (persistent bash + string-replace editor) and prompting recipe that enforces frequent compressions.

An empirical A/B test on 5 hard SWE-bench Lite instances showing 22.7% token reduction with equal task success when aggressive prompting is used.

Key Findings

Agent-controlled compression reduced total token use by 22.7% without lowering task success on the evaluated set.

NumbersTotal tokens 14,920,555 → 11,526,418 (−22.7%); task success 3/5 → 3/5

Compression yields much larger savings on exploration-heavy tasks but can add overhead on iterative-refinement tasks.

NumbersPer-instance savings ranged 18%–57% on 4/5 instances; one instance rose +110%

Aggressive, structured prompting was required for large savings; passive prompts gave only ~6% savings.

NumbersCompressions per task: passive ≈2.0 (≈6% savings) vs aggressive ≈6.0 (22.7% savings)

Results

Task Success (tests passed)

Value3/5 (60%) vs 3/5 (60%)

BaselineBaseline 3/5

Total Tokens

Value11,526,418 (Focus) vs 14,920,555 (Baseline)

Baseline14,920,555

Average Compressions per Task

Value6.0 (Focus) vs 0 (Baseline)

Baseline0

Average Messages Dropped per Task

Value70.2 (Focus) vs 0 (Baseline)

Baseline0

Per-instance Token Savings (examples)

Valuematplotlib-26020 −57%; seaborn-2848 −52%; sympy-21171 −57%; pylint-7080 +110%

BaselinePer-instance Baselines in Table II

Who Should Care

What To Try In 7 Days

Add start_focus / complete_focus primitives to your agent loop and a top-of-context Knowledge block.

Use a persistent shell and string-replace editor scaffold to match developer workflows.

Experiment with system prompts that require compression every 10–15 tool calls and inject reminders after long stretches without compression.

Agent Features

Memory

  • Autonomous intra-trajectory compression (prune recent logs)
  • Persistent Knowledge block for consolidated facts

Planning

  • Agent decides when to checkpoint and consolidate
  • Structured phases: explore → consolidate → implement → verify

Tool Use

  • Persistent bash shell
  • String-replace editor
  • Tool-heavy workflow (encouraged 100+ tool calls)

Frameworks

  • Focus loop primitives (start_focus, complete_focus)

Is Agentic

true

Architectures

  • ReAct-style loop with Focus extensions (start_focus / complete_focus)
  • Sawtooth context pattern (explore then collapse)

Optimization Features

Token Efficiency

  • 22.7% total token reduction on evaluated tasks
  • Frequent small compressions (every 10–15 calls) preferred over infrequent large ones
  • Token amortization: compression cost (hundreds tokens) saves thousands on long tasks

System Optimization

  • Sawtooth context management to avoid quadratic growth

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Small evaluation (N=5); results may not generalize to the full SWE-bench (N=300).
  • Only Claude Haiku 4.5 was tested; other LLMs may need different prompts.
  • Results depend on the two-tool scaffold (persistent bash + string-replace editor).
  • Aggressive compression can discard useful recent context and increase tokens for iterative refinement tasks.

When Not To Use

  • Tasks that require continuous accumulation of fine-grained state (iterative refinement).
  • Short tasks where compression overhead won't amortize.
  • Environments without a persistent tool scaffold similar to a shell and targeted editor.

Failure Modes

  • Over-aggressive pruning removes needed context and forces re-exploration, increasing tokens.
  • Model may follow compression prompts blindly and compress critical intermediate artifacts.
  • Prompting strategy may need tuning per model; a wrong prompt can harm accuracy.

Core Entities

Models

  • claude-haiku-4-5-20251001

Metrics

  • total_tokens
  • task_success
  • avg_compressions
  • messages_dropped
  • per-instance_token_savings

Datasets

  • SWE-bench Lite

Benchmarks

  • SWE-bench Lite