Focus: agent-controlled context compression that cuts token use 22.7% without losing accuracy

January 12, 20267 min

Overview

Decision SnapshotNeeds Validation

The idea is simple and practical but tested on only 5 instances and one commercial model; scaffolding and prompting are required to realize benefits in practice.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 67%

Production readiness: 55%

Novelty: 65%

Authors

Nikhil Verma

Links

Abstract / PDF

Why It Matters For Business

Active, agent-driven compression can cut API/token costs significantly on exploration-heavy automation tasks without reducing success, helping teams scale agent workflows at lower expense.

Who Should Care

Summary TLDR

This paper introduces Focus, a simple agent-level memory manager that lets an LLM decide when to summarize and delete recent interaction logs. Focus adds two tools (start_focus / complete_focus) and a persistent "Knowledge" block. On 5 hard SWE-bench Lite tasks with Claude Haiku 4.5, aggressive prompting to compress every 10–15 tool calls cut total token use by 22.7% (14.9M → 11.5M) while matching baseline task success (3/5). Savings concentrated on exploration-heavy bugs (up to 57% per instance); one iterative task saw higher token use. Key takeaway: explicit, frequent compression can reduce token cost without hurting accuracy for explore-then-implement workflows.

Problem Statement

Long-running LLM agents accumulate large interaction histories. This raises compute cost, increases latency, and can confuse the model with noisy past failures. Existing compression is usually external and passive; agents lack an autonomous, intra-trajectory way to prune raw logs while keeping what they learned.

Main Contribution

Focus agent loop: two primitives (start_focus, complete_focus) that let the model checkpoint, summarize learnings, append a persistent Knowledge block, and delete raw logs.

A practical scaffold (persistent bash + string-replace editor) and prompting recipe that enforces frequent compressions.

Key Findings

Agent-controlled compression reduced total token use by 22.7% without lowering task success on the evaluated set.

NumbersTotal tokens 14,920,55511,526,418 (−22.7%); task success 3/53/5

Practical UseIf you add focus primitives and aggressively prompt compression, you can cut token bills substantially on similar code-exploration tasks without losing correctness.

Evidence RefTable I; Results section

Compression yields much larger savings on exploration-heavy tasks but can add overhead on iterative-refinement tasks.

NumbersPer-instance savings ranged 18%–57% on 4/5 instances; one instance rose +110%

Practical UseApply Focus to explore-then-implement jobs (code navigation, large searches). Avoid aggressive pruning on tasks that need continuous, cumulative context.

Evidence RefTable II; Case studies (matplotlib-26020, pylint-7080)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Task Success (tests passed)3/5 (60%) vs 3/5 (60%)Baseline 3/5SameSWE-bench Lite, N=5Table I: Both agents passed 3/5 hard instancesTable I
Total Tokens11,526,418 (Focus) vs 14,920,555 (Baseline)14,920,555-22.7%SWE-bench Lite, N=5Table I total token countsTable I

What To Try In 7 Days

Add start_focus / complete_focus primitives to your agent loop and a top-of-context Knowledge block.

Use a persistent shell and string-replace editor scaffold to match developer workflows.

Experiment with system prompts that require compression every 10–15 tool calls and inject reminders after long stretches without compression.

Agent Features

Memory
Autonomous intra-trajectory compression (prune recent logs)Persistent Knowledge block for consolidated facts
Planning
Agent decides when to checkpoint and consolidateStructured phases: explore → consolidate → implement → verify
Tool Use
Persistent bash shellString-replace editorTool-heavy workflow (encouraged 100+ tool calls)
Frameworks
Focus loop primitives (start_focus, complete_focus)
Is Agentic

Yes

Architectures
ReAct-style loop with Focus extensions (start_focus / complete_focus)Sawtooth context pattern (explore then collapse)

Optimization Features

Token Efficiency
22.7% total token reduction on evaluated tasksFrequent small compressions (every 10–15 calls) preferred over infrequent large onesToken amortization: compression cost (hundreds tokens) saves thousands on long tasks
System Optimization
Sawtooth context management to avoid quadratic growth

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Small evaluation (N=5); results may not generalize to the full SWE-bench (N=300).

Only Claude Haiku 4.5 was tested; other LLMs may need different prompts.

When Not To Use

Tasks that require continuous accumulation of fine-grained state (iterative refinement).

Short tasks where compression overhead won't amortize.

Failure Modes

Over-aggressive pruning removes needed context and forces re-exploration, increasing tokens.

Model may follow compression prompts blindly and compress critical intermediate artifacts.

Core Entities

Models

claude-haiku-4-5-20251001

Metrics

total_tokenstask_successavg_compressionsmessages_droppedper-instance_token_savings

Datasets

SWE-bench Lite

Benchmarks

SWE-bench Lite