Keep agent context small forever by storing task state as files — proving more stable long-run behavior for research workflows

Overview

Decision SnapshotReady For Pilot

Design is clearly useful for long-running document workflows and is supported by benchmark and ablation data. Evidence is from a focused set of research-oriented tasks; broader generalization is untested.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Chenglin Yu, Yuchen Wang, Songmiao Wang, Hongxia Yang, Ming Li

Links

Abstract / PDF / Code

Why It Matters For Business

If your workflows involve long document processing or multi-step knowledge work, state management matters more than raw model size. A file-centric agent design can make smaller, cheaper models far more reliable over long runs and reduce costly re-runs.

Who Should Care

ML Engineer Engineering Lead Product Manager Data Scientist Founder

Summary TLDR

InfiAgent keeps an agent's working memory small by storing all persistent task state as files and reconstructing a fixed-size reasoning context from that workspace plus a short recent-action buffer. This file-centric design, a hierarchical agent stack, and an "external attention" tool pipeline improve reliability on long tasks (80-paper literature review) and let a 20B open model match larger systems on the DeepResearch benchmark without fine-tuning.

Problem Statement

Current LLM agents pack long-term state into the prompt. As tasks grow, prompts bloat and agents break: early errors accumulate, relevance is lost, and performance drops on long workflows.

Main Contribution

File-centric state abstraction: treat workspace files as the authoritative persistent state instead of embedding history in the prompt.

Bounded reasoning reconstruction: build each reasoning prompt from a workspace snapshot plus a fixed small window of recent actions (e.g., k=10) so context size stays constant.

Key Findings

InfiAgent (gpt-oss-20b) scores 41.45 on DeepResearch using no task-specific fine-tuning.

NumbersDeepResearch overall = 41.45 (Table 2)

Practical UseYou can partially regain performance lost from smaller models by redesigning agent state and flow — try file-based state before scaling model size.

Evidence RefSection 5.1; Table 2

On an 80-paper literature review, InfiAgent (gpt-oss-20b) averaged coverage 67.1 papers per run (max 80).

NumbersCoverage avg = 67.1, max = 80, min = 15 (Table 1)

Practical UseFor long document-processing jobs, externalizing state into files keeps agents running farther and more stably than prompt-centric designs.

Evidence RefSection 5.2; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
DeepResearch overall	41.45	larger proprietary agents (various)	—	DeepResearch benchmark	InfiAgent with gpt-oss-20b achieved 41.45 overall (Table 2)	Table 2
Literature review coverage (avg)	67.1 papers	No File State (gpt-oss-20b) avg 3.2	+63.9	80-paper literature review	InfiAgent avg coverage 67.1 vs ablation 3.2 (Table 1)	Table 1

What To Try In 7 Days

Prototype a workspace-as-state: store intermediate outputs and plans in files rather than appending into prompts.

Implement a fixed-size recent-action buffer (e.g., last 10 actions) to rebuild context for each step.

Wrap heavy-document reads in an isolated extractor tool (answer_from_pdf) and return only extracted facts.

Agent Features

Memory

persistent file-centric state (workspace files)fixed-size recent-action buffer (bounded short-term context)

Planning

top-down hierarchical planningrecursive DAG-based task decomposition

Tool Use

Agent-as-a-Tool (call lower-level agents as tools)external attention tools for document queryingfile I/O for persistent artifacts

Frameworks

InfiAgent

Is Agentic

Yes

Architectures

hierarchical stack (Alpha / Domain / Atomic)file-centric workspace hub

Collaboration

serial multi-agent delegation with strict parent-child control

Optimization Features

Token Efficiency

avoids re-including long histories in the prompt

Infra Optimization

serial execution for state consistency (trade-off: higher latency)

System Optimization

periodic state consolidation to refresh workspace snapshots

Inference Optimization

bounded context reduces per-step prompt sizeexternal attention offloads heavy reading to isolated processes

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ChenglinPoly/infiAgent

Risks & Boundaries

Limitations

Does not fix model reasoning errors: wrong outputs can be written into persistent state and propagated (Section 6).

Introduces latency: serial hierarchical execution and file operations raise response time, making it less suitable for real-time use (Section 6).

When Not To Use

Real-time interactive applications where low latency matters.

Workloads that require heavy parallelism across independent subtasks without strict serialization.

Failure Modes

Hallucination propagation: an early incorrect artifact saved to files can mislead later steps.

Early termination or skipping items in long runs if higher-level orchestration fails (observed in baselines).

Core Entities

Models

gpt-oss-20bGemini-3-FlashClaude-4.5-SonnetGemini-2.5-ProGPT-4oGPT-5 (various)

Metrics

coverage (long-horizon reliability)DeepResearch overall scorecomprehensivenessinsightinstruction followingreadability

Datasets

DeepResearch benchmark80-paper literature review (custom task)

Benchmarks

DeepResearch

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

InfiAgent (gpt-oss-20b) scores 41.45 on DeepResearch using no task-specific fine-tuning.

On an 80-paper literature review, InfiAgent (gpt-oss-20b) averaged coverage 67.1 papers per run (max 80).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

Key finding

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Key finding

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

Key finding

G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

Key finding