Overview
The idea is simple and practical but tested on only 5 instances and one commercial model; scaffolding and prompting are required to realize benefits in practice.
Citations0
Evidence Strength0.60
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 67%
Production readiness: 55%
Novelty: 65%
Why It Matters For Business
Active, agent-driven compression can cut API/token costs significantly on exploration-heavy automation tasks without reducing success, helping teams scale agent workflows at lower expense.
Who Should Care
Summary TLDR
This paper introduces Focus, a simple agent-level memory manager that lets an LLM decide when to summarize and delete recent interaction logs. Focus adds two tools (start_focus / complete_focus) and a persistent "Knowledge" block. On 5 hard SWE-bench Lite tasks with Claude Haiku 4.5, aggressive prompting to compress every 10–15 tool calls cut total token use by 22.7% (14.9M → 11.5M) while matching baseline task success (3/5). Savings concentrated on exploration-heavy bugs (up to 57% per instance); one iterative task saw higher token use. Key takeaway: explicit, frequent compression can reduce token cost without hurting accuracy for explore-then-implement workflows.
Problem Statement
Long-running LLM agents accumulate large interaction histories. This raises compute cost, increases latency, and can confuse the model with noisy past failures. Existing compression is usually external and passive; agents lack an autonomous, intra-trajectory way to prune raw logs while keeping what they learned.
Main Contribution
Focus agent loop: two primitives (start_focus, complete_focus) that let the model checkpoint, summarize learnings, append a persistent Knowledge block, and delete raw logs.
A practical scaffold (persistent bash + string-replace editor) and prompting recipe that enforces frequent compressions.
Key Findings
Agent-controlled compression reduced total token use by 22.7% without lowering task success on the evaluated set.
Compression yields much larger savings on exploration-heavy tasks but can add overhead on iterative-refinement tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Task Success (tests passed) | 3/5 (60%) vs 3/5 (60%) | Baseline 3/5 | Same | SWE-bench Lite, N=5 | Table I: Both agents passed 3/5 hard instances | Table I |
| Total Tokens | 11,526,418 (Focus) vs 14,920,555 (Baseline) | 14,920,555 | -22.7% | SWE-bench Lite, N=5 | Table I total token counts | Table I |
What To Try In 7 Days
Add start_focus / complete_focus primitives to your agent loop and a top-of-context Knowledge block.
Use a persistent shell and string-replace editor scaffold to match developer workflows.
Experiment with system prompts that require compression every 10–15 tool calls and inject reminders after long stretches without compression.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Small evaluation (N=5); results may not generalize to the full SWE-bench (N=300).
Only Claude Haiku 4.5 was tested; other LLMs may need different prompts.
When Not To Use
Tasks that require continuous accumulation of fine-grained state (iterative refinement).
Short tasks where compression overhead won't amortize.
Failure Modes
Over-aggressive pruning removes needed context and forces re-exploration, increasing tokens.
Model may follow compression prompts blindly and compress critical intermediate artifacts.

