Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If your workflows involve long document processing or multi-step knowledge work, state management matters more than raw model size. A file-centric agent design can make smaller, cheaper models far more reliable over long runs and reduce costly re-runs.
Summary TLDR
InfiAgent keeps an agent's working memory small by storing all persistent task state as files and reconstructing a fixed-size reasoning context from that workspace plus a short recent-action buffer. This file-centric design, a hierarchical agent stack, and an "external attention" tool pipeline improve reliability on long tasks (80-paper literature review) and let a 20B open model match larger systems on the DeepResearch benchmark without fine-tuning.
Problem Statement
Current LLM agents pack long-term state into the prompt. As tasks grow, prompts bloat and agents break: early errors accumulate, relevance is lost, and performance drops on long workflows.
Main Contribution
File-centric state abstraction: treat workspace files as the authoritative persistent state instead of embedding history in the prompt.
Bounded reasoning reconstruction: build each reasoning prompt from a workspace snapshot plus a fixed small window of recent actions (e.g., k=10) so context size stays constant.
Hierarchical agent architecture: Alpha (high-level), Domain (specialist), and Atomic (tool) agents with strict parent-child control to reduce execution chaos.
External attention pipeline: offload document reading to isolated tool calls that return only extracted answers, avoiding context bloat.
Empirical demonstration: shows improved long-horizon coverage on an 80-paper literature review and competitive DeepResearch scores with a 20B open model.
Key Findings
InfiAgent (gpt-oss-20b) scores 41.45 on DeepResearch using no task-specific fine-tuning.
On an 80-paper literature review, InfiAgent (gpt-oss-20b) averaged coverage 67.1 papers per run (max 80).
Removing file-centric state and using compressed long-context prompts drops coverage for the 20B model from avg 67.1 to avg 3.2.
Stronger backbones with the InfiAgent design (Gemini-3-Flash, Claude-4.5) achieved perfect coverage (avg 80) on the 80-paper task.
Results
DeepResearch overall
Literature review coverage (avg)
Literature review coverage (best-performing backbones)
Who Should Care
What To Try In 7 Days
Prototype a workspace-as-state: store intermediate outputs and plans in files rather than appending into prompts.
Implement a fixed-size recent-action buffer (e.g., last 10 actions) to rebuild context for each step.
Wrap heavy-document reads in an isolated extractor tool (answer_from_pdf) and return only extracted facts.
Agent Features
Memory
- persistent file-centric state (workspace files)
- fixed-size recent-action buffer (bounded short-term context)
Planning
- top-down hierarchical planning
- recursive DAG-based task decomposition
Tool Use
- Agent-as-a-Tool (call lower-level agents as tools)
- external attention tools for document querying
- file I/O for persistent artifacts
Frameworks
- InfiAgent
Is Agentic
true
Architectures
- hierarchical stack (Alpha / Domain / Atomic)
- file-centric workspace hub
Collaboration
- serial multi-agent delegation with strict parent-child control
Optimization Features
Token Efficiency
- avoids re-including long histories in the prompt
Infra Optimization
- serial execution for state consistency (trade-off: higher latency)
System Optimization
- periodic state consolidation to refresh workspace snapshots
Inference Optimization
- bounded context reduces per-step prompt size
- external attention offloads heavy reading to isolated processes
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Does not fix model reasoning errors: wrong outputs can be written into persistent state and propagated (Section 6).
- Introduces latency: serial hierarchical execution and file operations raise response time, making it less suitable for real-time use (Section 6).
- Does not support parallel processing out of the box: current design enforces serial execution for state consistency (Section 7).
- Evaluation is concentrated on research-style, document-heavy tasks; reactive or embodied tasks are not evaluated (Section 6).
When Not To Use
- Real-time interactive applications where low latency matters.
- Workloads that require heavy parallelism across independent subtasks without strict serialization.
- High-stakes domains without robust validation, since persistent state can lock in hallucinations.
Failure Modes
- Hallucination propagation: an early incorrect artifact saved to files can mislead later steps.
- Early termination or skipping items in long runs if higher-level orchestration fails (observed in baselines).
- Increased variance with compressed long-context substitutes — results show severe drop without file state.
Core Entities
Models
- gpt-oss-20b
- Gemini-3-Flash
- Claude-4.5-Sonnet
- Gemini-2.5-Pro
- GPT-4o
- GPT-5 (various)
Metrics
- coverage (long-horizon reliability)
- DeepResearch overall score
- comprehensiveness
- insight
- instruction following
- readability
Datasets
- DeepResearch benchmark
- 80-paper literature review (custom task)
Benchmarks
- DeepResearch

