Keep agent context small forever by storing task state as files — proving more stable long-run behavior for research workflows

January 6, 20267 min

Overview

Decision SnapshotReady For Pilot

Design is clearly useful for long-running document workflows and is supported by benchmark and ablation data. Evidence is from a focused set of research-oriented tasks; broader generalization is untested.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Chenglin Yu, Yuchen Wang, Songmiao Wang, Hongxia Yang, Ming Li

Links

Abstract / PDF / Code

Why It Matters For Business

If your workflows involve long document processing or multi-step knowledge work, state management matters more than raw model size. A file-centric agent design can make smaller, cheaper models far more reliable over long runs and reduce costly re-runs.

Who Should Care

Summary TLDR

InfiAgent keeps an agent's working memory small by storing all persistent task state as files and reconstructing a fixed-size reasoning context from that workspace plus a short recent-action buffer. This file-centric design, a hierarchical agent stack, and an "external attention" tool pipeline improve reliability on long tasks (80-paper literature review) and let a 20B open model match larger systems on the DeepResearch benchmark without fine-tuning.

Problem Statement

Current LLM agents pack long-term state into the prompt. As tasks grow, prompts bloat and agents break: early errors accumulate, relevance is lost, and performance drops on long workflows.

Main Contribution

File-centric state abstraction: treat workspace files as the authoritative persistent state instead of embedding history in the prompt.

Bounded reasoning reconstruction: build each reasoning prompt from a workspace snapshot plus a fixed small window of recent actions (e.g., k=10) so context size stays constant.

Key Findings

InfiAgent (gpt-oss-20b) scores 41.45 on DeepResearch using no task-specific fine-tuning.

NumbersDeepResearch overall = 41.45 (Table 2)

Practical UseYou can partially regain performance lost from smaller models by redesigning agent state and flow — try file-based state before scaling model size.

Evidence RefSection 5.1; Table 2

On an 80-paper literature review, InfiAgent (gpt-oss-20b) averaged coverage 67.1 papers per run (max 80).

NumbersCoverage avg = 67.1, max = 80, min = 15 (Table 1)

Practical UseFor long document-processing jobs, externalizing state into files keeps agents running farther and more stably than prompt-centric designs.

Evidence RefSection 5.2; Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
DeepResearch overall41.45larger proprietary agents (various)DeepResearch benchmarkInfiAgent with gpt-oss-20b achieved 41.45 overall (Table 2)Table 2
Literature review coverage (avg)67.1 papersNo File State (gpt-oss-20b) avg 3.2+63.980-paper literature reviewInfiAgent avg coverage 67.1 vs ablation 3.2 (Table 1)Table 1

What To Try In 7 Days

Prototype a workspace-as-state: store intermediate outputs and plans in files rather than appending into prompts.

Implement a fixed-size recent-action buffer (e.g., last 10 actions) to rebuild context for each step.

Wrap heavy-document reads in an isolated extractor tool (answer_from_pdf) and return only extracted facts.

Agent Features

Memory
persistent file-centric state (workspace files)fixed-size recent-action buffer (bounded short-term context)
Planning
top-down hierarchical planningrecursive DAG-based task decomposition
Tool Use
Agent-as-a-Tool (call lower-level agents as tools)external attention tools for document queryingfile I/O for persistent artifacts
Frameworks
InfiAgent
Is Agentic

Yes

Architectures
hierarchical stack (Alpha / Domain / Atomic)file-centric workspace hub
Collaboration
serial multi-agent delegation with strict parent-child control

Optimization Features

Token Efficiency
avoids re-including long histories in the prompt
Infra Optimization
serial execution for state consistency (trade-off: higher latency)
System Optimization
periodic state consolidation to refresh workspace snapshots
Inference Optimization
bounded context reduces per-step prompt sizeexternal attention offloads heavy reading to isolated processes

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Does not fix model reasoning errors: wrong outputs can be written into persistent state and propagated (Section 6).

Introduces latency: serial hierarchical execution and file operations raise response time, making it less suitable for real-time use (Section 6).

When Not To Use

Real-time interactive applications where low latency matters.

Workloads that require heavy parallelism across independent subtasks without strict serialization.

Failure Modes

Hallucination propagation: an early incorrect artifact saved to files can mislead later steps.

Early termination or skipping items in long runs if higher-level orchestration fails (observed in baselines).

Core Entities

Models

gpt-oss-20bGemini-3-FlashClaude-4.5-SonnetGemini-2.5-ProGPT-4oGPT-5 (various)

Metrics

coverage (long-horizon reliability)DeepResearch overall scorecomprehensivenessinsightinstruction followingreadability

Datasets

DeepResearch benchmark80-paper literature review (custom task)

Benchmarks

DeepResearch