Focus: agent-controlled context compression that cuts token use 22.7% without losing accuracy

Overview

Decision SnapshotNeeds Validation

The idea is simple and practical but tested on only 5 instances and one commercial model; scaffolding and prompting are required to realize benefits in practice.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 67%

Production readiness: 55%

Novelty: 65%

Authors

Nikhil Verma

Links

Abstract / PDF

Why It Matters For Business

Active, agent-driven compression can cut API/token costs significantly on exploration-heavy automation tasks without reducing success, helping teams scale agent workflows at lower expense.

Who Should Care

CTO Engineering Lead ML Engineer Product Manager Founder

Summary TLDR

This paper introduces Focus, a simple agent-level memory manager that lets an LLM decide when to summarize and delete recent interaction logs. Focus adds two tools (start_focus / complete_focus) and a persistent "Knowledge" block. On 5 hard SWE-bench Lite tasks with Claude Haiku 4.5, aggressive prompting to compress every 10–15 tool calls cut total token use by 22.7% (14.9M → 11.5M) while matching baseline task success (3/5). Savings concentrated on exploration-heavy bugs (up to 57% per instance); one iterative task saw higher token use. Key takeaway: explicit, frequent compression can reduce token cost without hurting accuracy for explore-then-implement workflows.

Problem Statement

Long-running LLM agents accumulate large interaction histories. This raises compute cost, increases latency, and can confuse the model with noisy past failures. Existing compression is usually external and passive; agents lack an autonomous, intra-trajectory way to prune raw logs while keeping what they learned.

Main Contribution

Focus agent loop: two primitives (start_focus, complete_focus) that let the model checkpoint, summarize learnings, append a persistent Knowledge block, and delete raw logs.

A practical scaffold (persistent bash + string-replace editor) and prompting recipe that enforces frequent compressions.

Key Findings

Agent-controlled compression reduced total token use by 22.7% without lowering task success on the evaluated set.

NumbersTotal tokens 14,920,555 → 11,526,418 (−22.7%); task success 3/5 → 3/5

Practical UseIf you add focus primitives and aggressively prompt compression, you can cut token bills substantially on similar code-exploration tasks without losing correctness.

Evidence RefTable I; Results section

Compression yields much larger savings on exploration-heavy tasks but can add overhead on iterative-refinement tasks.

NumbersPer-instance savings ranged 18%–57% on 4/5 instances; one instance rose +110%

Practical UseApply Focus to explore-then-implement jobs (code navigation, large searches). Avoid aggressive pruning on tasks that need continuous, cumulative context.

Evidence RefTable II; Case studies (matplotlib-26020, pylint-7080)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Task Success (tests passed)	3/5 (60%) vs 3/5 (60%)	Baseline 3/5	Same	SWE-bench Lite, N=5	Table I: Both agents passed 3/5 hard instances	Table I
Total Tokens	11,526,418 (Focus) vs 14,920,555 (Baseline)	14,920,555	-22.7%	SWE-bench Lite, N=5	Table I total token counts	Table I

What To Try In 7 Days

Add start_focus / complete_focus primitives to your agent loop and a top-of-context Knowledge block.

Use a persistent shell and string-replace editor scaffold to match developer workflows.

Experiment with system prompts that require compression every 10–15 tool calls and inject reminders after long stretches without compression.

Agent Features

Memory

Autonomous intra-trajectory compression (prune recent logs)Persistent Knowledge block for consolidated facts

Planning

Agent decides when to checkpoint and consolidateStructured phases: explore → consolidate → implement → verify

Tool Use

Persistent bash shellString-replace editorTool-heavy workflow (encouraged 100+ tool calls)

Frameworks

Focus loop primitives (start_focus, complete_focus)

Is Agentic

Yes

Architectures

ReAct-style loop with Focus extensions (start_focus / complete_focus)Sawtooth context pattern (explore then collapse)

Optimization Features

Token Efficiency

22.7% total token reduction on evaluated tasksFrequent small compressions (every 10–15 calls) preferred over infrequent large onesToken amortization: compression cost (hundreds tokens) saves thousands on long tasks

System Optimization

Sawtooth context management to avoid quadratic growth

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Small evaluation (N=5); results may not generalize to the full SWE-bench (N=300).

Only Claude Haiku 4.5 was tested; other LLMs may need different prompts.

When Not To Use

Tasks that require continuous accumulation of fine-grained state (iterative refinement).

Short tasks where compression overhead won't amortize.

Failure Modes

Over-aggressive pruning removes needed context and forces re-exploration, increasing tokens.

Model may follow compression prompts blindly and compress critical intermediate artifacts.

Core Entities

Models

claude-haiku-4-5-20251001

Metrics

total_tokenstask_successavg_compressionsmessages_droppedper-instance_token_savings

Datasets

SWE-bench Lite

Benchmarks

SWE-bench Lite

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Agent-controlled compression reduced total token use by 22.7% without lowering task success on the evaluated set.

Compression yields much larger savings on exploration-heavy tasks but can add overhead on iterative-refinement tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Prompt LLMs to propose hyperparameters and training code; they match or beat standard HPO early in search.

Key finding

MiniCache: merge adjacent layers' KV caches to cut memory and speed up LLM inference

Key finding

PE‑Rank: compress passages into embeddings to speed LLM listwise reranking

Key finding