Overview
Production Readiness
0.6
Novelty Score
0.45
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Agent context files control what AI developers do in your codebase. If they lack security or performance rules, agents will likely produce code that works but is vulnerable or inefficient. Treat these files like configuration and governance documents so agents follow team standards.
Summary TLDR
The authors analyze 2,303 agent context files (e.g., CLAUDE.md, AGENTS.md) from 1,925 repos to show these files are long, hard to read, actively maintained, and biased toward functional instructions (build, testing, implementation). Non-functional concerns like security and performance are rare. Automated labeling of these files is feasible (micro F1 0.79) for concrete topics but struggles with abstract guidance.
Problem Statement
AI coding agents rely on persistent, project-level instruction files to act correctly. We lack evidence about what those files contain, how they evolve, and whether we can automatically monitor them. Without that evidence, agents can be well‑informed about how to run code but poorly constrained on safety, performance, or quality.
Main Contribution
A large empirical corpus: 2,303 agent context files from 1,925 open-source repositories across Claude Code, OpenAI Codex, and GitHub Copilot.
A 16‑label taxonomy of agent instructions (e.g., Build & Run, Testing, Architecture, Security) and prevalence counts.
Maintenance analysis showing manifests are actively edited in short bursts and evolve via small additions.
An automatic multi-label classifier (GPT-5) demonstration achieving micro F1 = 0.79 on these categories, with higher accuracy on concrete topics.
Key Findings
Collected 2,303 agent context files across 1,925 repositories.
Files are long and differ by tool: Copilot and Claude files are longer than Codex.
Files are hard to read, especially for Claude.
Instruction content skews heavily to functional operations.
Non-functional requirements are rarely specified.
Manifests are actively maintained and updated in short bursts.
Evolution is driven by small additions rather than deletions.
Automatic multi-label classification is feasible for many concrete categories.
Results
Corpus size
Median words per file
Readability (FRE median)
Prevalence of instruction types (top items)
Low-prevalence NFRs
Maintenance activity (multi-commit rate)
Update intervals (median)
Automatic classification performance
Who Should Care
What To Try In 7 Days
Scan your repo for agent context files (CLAUDE.md, AGENTS.md, copilot-instructions.md).
Add a short 'Non-functional requirements' section that lists mandatory security and performance rules.
Include context-file checks in PR templates: 'Did you update the agent manifest if build or API changed?' and require a CODEOWNER approval for manifest edits.
Agent Features
Memory
- persistent project-level context files (long-term memory)
Planning
- task decomposition
- multi-step planning
Tool Use
- IDE/tool invocation (run tests, execute scripts)
- CI/CD and build commands
Frameworks
- Claude Code
- OpenAI Codex
- GitHub Copilot
Is Agentic
true
Architectures
- LLM-based agents with tool use
- agents that combine memory, planning, and tool APIs
Collaboration
- human-in-the-loop review and code owners
Optimization Features
Token Efficiency
- recommend compressing or selecting sections to reduce token cost
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Manual labels record only presence, not the depth of a topic (binary labeling may overstate importance).
- Dataset limited to public repos and three agent tools; private corpora may differ.
- Readability (FRE) and word counts are coarse proxies for human/agent comprehension.
- Automatic classification results reflect a single model (GPT-5) and a manually curated prompt; results may vary with other models or prompts.
When Not To Use
- Do not generalize prevalence numbers to private or enterprise-only repositories without further sampling.
- Avoid using the taxonomy as a strict checklist for highly domain-specific projects without tailoring.
Failure Modes
- Agents produce insecure or inefficient code if manifests omit NFRs like security and performance.
- Manifests may become append-only and contradictory if not versioned or reviewed.
- Automated classifiers can miss abstract or low-frequency instructions (maintenance, project management).
Core Entities
Models
- GPT-5
- Claude Opus 4.1
- Gemini 2.5 Pro
Metrics
- Flesch Reading Ease (FRE)
- Micro-average F1
- Precision/Recall/F1 per label
- Median word counts
- Mann-Whitney U, Cliff's delta
Datasets
- AIDev dataset (repos list used for selection)
- Replication package dataset (Agent-Context-File-Analysis)
Benchmarks
- SWE-bench (cited as example benchmark work)

