Keep agent context small forever by storing task state as files — proving more stable long-run behavior for research workflows

January 6, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Chenglin Yu, Yuchen Wang, Songmiao Wang, Hongxia Yang, Ming Li

Links

Abstract / PDF

Why It Matters For Business

If your workflows involve long document processing or multi-step knowledge work, state management matters more than raw model size. A file-centric agent design can make smaller, cheaper models far more reliable over long runs and reduce costly re-runs.

Summary TLDR

InfiAgent keeps an agent's working memory small by storing all persistent task state as files and reconstructing a fixed-size reasoning context from that workspace plus a short recent-action buffer. This file-centric design, a hierarchical agent stack, and an "external attention" tool pipeline improve reliability on long tasks (80-paper literature review) and let a 20B open model match larger systems on the DeepResearch benchmark without fine-tuning.

Problem Statement

Current LLM agents pack long-term state into the prompt. As tasks grow, prompts bloat and agents break: early errors accumulate, relevance is lost, and performance drops on long workflows.

Main Contribution

File-centric state abstraction: treat workspace files as the authoritative persistent state instead of embedding history in the prompt.

Bounded reasoning reconstruction: build each reasoning prompt from a workspace snapshot plus a fixed small window of recent actions (e.g., k=10) so context size stays constant.

Hierarchical agent architecture: Alpha (high-level), Domain (specialist), and Atomic (tool) agents with strict parent-child control to reduce execution chaos.

External attention pipeline: offload document reading to isolated tool calls that return only extracted answers, avoiding context bloat.

Empirical demonstration: shows improved long-horizon coverage on an 80-paper literature review and competitive DeepResearch scores with a 20B open model.

Key Findings

InfiAgent (gpt-oss-20b) scores 41.45 on DeepResearch using no task-specific fine-tuning.

NumbersDeepResearch overall = 41.45 (Table 2)

On an 80-paper literature review, InfiAgent (gpt-oss-20b) averaged coverage 67.1 papers per run (max 80).

NumbersCoverage avg = 67.1, max = 80, min = 15 (Table 1)

Removing file-centric state and using compressed long-context prompts drops coverage for the 20B model from avg 67.1 to avg 3.2.

NumbersNo File State (20B) coverage avg = 3.2 (Table 1 ablation)

Stronger backbones with the InfiAgent design (Gemini-3-Flash, Claude-4.5) achieved perfect coverage (avg 80) on the 80-paper task.

NumbersGemini-3-Flash avg = 80.0; Claude-4.5 avg = 80.0 (Table 1)

Results

DeepResearch overall

Value41.45

Baselinelarger proprietary agents (various)

Literature review coverage (avg)

Value67.1 papers

BaselineNo File State (gpt-oss-20b) avg 3.2

Literature review coverage (best-performing backbones)

Value80.0 papers (avg)

BaselineInfiAgent (20B) avg 67.1

Who Should Care

What To Try In 7 Days

Prototype a workspace-as-state: store intermediate outputs and plans in files rather than appending into prompts.

Implement a fixed-size recent-action buffer (e.g., last 10 actions) to rebuild context for each step.

Wrap heavy-document reads in an isolated extractor tool (answer_from_pdf) and return only extracted facts.

Agent Features

Memory

  • persistent file-centric state (workspace files)
  • fixed-size recent-action buffer (bounded short-term context)

Planning

  • top-down hierarchical planning
  • recursive DAG-based task decomposition

Tool Use

  • Agent-as-a-Tool (call lower-level agents as tools)
  • external attention tools for document querying
  • file I/O for persistent artifacts

Frameworks

  • InfiAgent

Is Agentic

true

Architectures

  • hierarchical stack (Alpha / Domain / Atomic)
  • file-centric workspace hub

Collaboration

  • serial multi-agent delegation with strict parent-child control

Optimization Features

Token Efficiency

  • avoids re-including long histories in the prompt

Infra Optimization

  • serial execution for state consistency (trade-off: higher latency)

System Optimization

  • periodic state consolidation to refresh workspace snapshots

Inference Optimization

  • bounded context reduces per-step prompt size
  • external attention offloads heavy reading to isolated processes

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Does not fix model reasoning errors: wrong outputs can be written into persistent state and propagated (Section 6).
  • Introduces latency: serial hierarchical execution and file operations raise response time, making it less suitable for real-time use (Section 6).
  • Does not support parallel processing out of the box: current design enforces serial execution for state consistency (Section 7).
  • Evaluation is concentrated on research-style, document-heavy tasks; reactive or embodied tasks are not evaluated (Section 6).

When Not To Use

  • Real-time interactive applications where low latency matters.
  • Workloads that require heavy parallelism across independent subtasks without strict serialization.
  • High-stakes domains without robust validation, since persistent state can lock in hallucinations.

Failure Modes

  • Hallucination propagation: an early incorrect artifact saved to files can mislead later steps.
  • Early termination or skipping items in long runs if higher-level orchestration fails (observed in baselines).
  • Increased variance with compressed long-context substitutes — results show severe drop without file state.

Core Entities

Models

  • gpt-oss-20b
  • Gemini-3-Flash
  • Claude-4.5-Sonnet
  • Gemini-2.5-Pro
  • GPT-4o
  • GPT-5 (various)

Metrics

  • coverage (long-horizon reliability)
  • DeepResearch overall score
  • comprehensiveness
  • insight
  • instruction following
  • readability

Datasets

  • DeepResearch benchmark
  • 80-paper literature review (custom task)

Benchmarks

  • DeepResearch