Overview
Production Readiness
0.55
Novelty Score
0.65
Cost Impact Score
0.67
Citation Count
0
Why It Matters For Business
Active, agent-driven compression can cut API/token costs significantly on exploration-heavy automation tasks without reducing success, helping teams scale agent workflows at lower expense.
Summary TLDR
This paper introduces Focus, a simple agent-level memory manager that lets an LLM decide when to summarize and delete recent interaction logs. Focus adds two tools (start_focus / complete_focus) and a persistent "Knowledge" block. On 5 hard SWE-bench Lite tasks with Claude Haiku 4.5, aggressive prompting to compress every 10–15 tool calls cut total token use by 22.7% (14.9M → 11.5M) while matching baseline task success (3/5). Savings concentrated on exploration-heavy bugs (up to 57% per instance); one iterative task saw higher token use. Key takeaway: explicit, frequent compression can reduce token cost without hurting accuracy for explore-then-implement workflows.
Problem Statement
Long-running LLM agents accumulate large interaction histories. This raises compute cost, increases latency, and can confuse the model with noisy past failures. Existing compression is usually external and passive; agents lack an autonomous, intra-trajectory way to prune raw logs while keeping what they learned.
Main Contribution
Focus agent loop: two primitives (start_focus, complete_focus) that let the model checkpoint, summarize learnings, append a persistent Knowledge block, and delete raw logs.
A practical scaffold (persistent bash + string-replace editor) and prompting recipe that enforces frequent compressions.
An empirical A/B test on 5 hard SWE-bench Lite instances showing 22.7% token reduction with equal task success when aggressive prompting is used.
Key Findings
Agent-controlled compression reduced total token use by 22.7% without lowering task success on the evaluated set.
Compression yields much larger savings on exploration-heavy tasks but can add overhead on iterative-refinement tasks.
Aggressive, structured prompting was required for large savings; passive prompts gave only ~6% savings.
Results
Task Success (tests passed)
Total Tokens
Average Compressions per Task
Average Messages Dropped per Task
Per-instance Token Savings (examples)
Who Should Care
What To Try In 7 Days
Add start_focus / complete_focus primitives to your agent loop and a top-of-context Knowledge block.
Use a persistent shell and string-replace editor scaffold to match developer workflows.
Experiment with system prompts that require compression every 10–15 tool calls and inject reminders after long stretches without compression.
Agent Features
Memory
- Autonomous intra-trajectory compression (prune recent logs)
- Persistent Knowledge block for consolidated facts
Planning
- Agent decides when to checkpoint and consolidate
- Structured phases: explore → consolidate → implement → verify
Tool Use
- Persistent bash shell
- String-replace editor
- Tool-heavy workflow (encouraged 100+ tool calls)
Frameworks
- Focus loop primitives (start_focus, complete_focus)
Is Agentic
true
Architectures
- ReAct-style loop with Focus extensions (start_focus / complete_focus)
- Sawtooth context pattern (explore then collapse)
Optimization Features
Token Efficiency
- 22.7% total token reduction on evaluated tasks
- Frequent small compressions (every 10–15 calls) preferred over infrequent large ones
- Token amortization: compression cost (hundreds tokens) saves thousands on long tasks
System Optimization
- Sawtooth context management to avoid quadratic growth
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Small evaluation (N=5); results may not generalize to the full SWE-bench (N=300).
- Only Claude Haiku 4.5 was tested; other LLMs may need different prompts.
- Results depend on the two-tool scaffold (persistent bash + string-replace editor).
- Aggressive compression can discard useful recent context and increase tokens for iterative refinement tasks.
When Not To Use
- Tasks that require continuous accumulation of fine-grained state (iterative refinement).
- Short tasks where compression overhead won't amortize.
- Environments without a persistent tool scaffold similar to a shell and targeted editor.
Failure Modes
- Over-aggressive pruning removes needed context and forces re-exploration, increasing tokens.
- Model may follow compression prompts blindly and compress critical intermediate artifacts.
- Prompting strategy may need tuning per model; a wrong prompt can harm accuracy.
Core Entities
Models
- claude-haiku-4-5-20251001
Metrics
- total_tokens
- task_success
- avg_compressions
- messages_dropped
- per-instance_token_savings
Datasets
- SWE-bench Lite
Benchmarks
- SWE-bench Lite

