Open-source projects store agent instructions in special README-like files, but those files focus on how to run code and rarely specify non‑

November 17, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.6

Citation Count

0

Authors

Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjanasith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, Hajimu Iida

Links

Abstract / PDF

Why It Matters For Business

Agent context files control what AI developers do in your codebase. If they lack security or performance rules, agents will likely produce code that works but is vulnerable or inefficient. Treat these files like configuration and governance documents so agents follow team standards.

Summary TLDR

The authors analyze 2,303 agent context files (e.g., CLAUDE.md, AGENTS.md) from 1,925 repos to show these files are long, hard to read, actively maintained, and biased toward functional instructions (build, testing, implementation). Non-functional concerns like security and performance are rare. Automated labeling of these files is feasible (micro F1 0.79) for concrete topics but struggles with abstract guidance.

Problem Statement

AI coding agents rely on persistent, project-level instruction files to act correctly. We lack evidence about what those files contain, how they evolve, and whether we can automatically monitor them. Without that evidence, agents can be well‑informed about how to run code but poorly constrained on safety, performance, or quality.

Main Contribution

A large empirical corpus: 2,303 agent context files from 1,925 open-source repositories across Claude Code, OpenAI Codex, and GitHub Copilot.

A 16‑label taxonomy of agent instructions (e.g., Build & Run, Testing, Architecture, Security) and prevalence counts.

Maintenance analysis showing manifests are actively edited in short bursts and evolve via small additions.

An automatic multi-label classifier (GPT-5) demonstration achieving micro F1 = 0.79 on these categories, with higher accuracy on concrete topics.

Key Findings

Collected 2,303 agent context files across 1,925 repositories.

Numbers2,303 files; 1,925 repos

Files are long and differ by tool: Copilot and Claude files are longer than Codex.

NumbersMedian words: Copilot 535, Claude 485, Codex 335.5

Files are hard to read, especially for Claude.

NumbersFlesch Reading Ease medians: Claude 16.6, Copilot 26.6, Codex 39.6

Instruction content skews heavily to functional operations.

NumbersTesting 75.0%, Implementation Details 69.9%, Architecture 67.7%, Build & Run 62.3%

Non-functional requirements are rarely specified.

NumbersSecurity 14.5%, Performance 14.5%, UI/UX 8.7%

Manifests are actively maintained and updated in short bursts.

NumbersMulti-commit rate: Claude 67.4%; update intervals median: Claude 24.1h, Codex 22.0h, Copilot 70.7h

Evolution is driven by small additions rather than deletions.

NumbersClaude median +57 words per update; deletions median <15 words

Automatic multi-label classification is feasible for many concrete categories.

NumbersMicro-average F1 = 0.79; Testing F1 = 0.94; Architecture F1 = 0.93; Build & Run F1 = 0.92

Results

Corpus size

Value2,303 agent context files from 1,925 repos

Median words per file

ValueCopilot 535, Claude 485, Codex 335.5

Readability (FRE median)

ValueClaude 16.6, Copilot 26.6, Codex 39.6

Prevalence of instruction types (top items)

ValueTesting 75.0%, Implementation Details 69.9%, Architecture 67.7%, Build & Run 62.3%

Low-prevalence NFRs

ValueSecurity 14.5%, Performance 14.5%, UI/UX 8.7%

Maintenance activity (multi-commit rate)

ValueClaude 67.4%, Copilot 59.7%, Codex 59.2%

Baselinetraditional README baseline: often write-once (cited prior work)

Update intervals (median)

ValueClaude 24.1h, Codex 22.0h, Copilot 70.7h

Automatic classification performance

ValueMicro-average F1 = 0.79

Who Should Care

What To Try In 7 Days

Scan your repo for agent context files (CLAUDE.md, AGENTS.md, copilot-instructions.md).

Add a short 'Non-functional requirements' section that lists mandatory security and performance rules.

Include context-file checks in PR templates: 'Did you update the agent manifest if build or API changed?' and require a CODEOWNER approval for manifest edits.

Agent Features

Memory

  • persistent project-level context files (long-term memory)

Planning

  • task decomposition
  • multi-step planning

Tool Use

  • IDE/tool invocation (run tests, execute scripts)
  • CI/CD and build commands

Frameworks

  • Claude Code
  • OpenAI Codex
  • GitHub Copilot

Is Agentic

true

Architectures

  • LLM-based agents with tool use
  • agents that combine memory, planning, and tool APIs

Collaboration

  • human-in-the-loop review and code owners

Optimization Features

Token Efficiency

  • recommend compressing or selecting sections to reduce token cost

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Manual labels record only presence, not the depth of a topic (binary labeling may overstate importance).
  • Dataset limited to public repos and three agent tools; private corpora may differ.
  • Readability (FRE) and word counts are coarse proxies for human/agent comprehension.
  • Automatic classification results reflect a single model (GPT-5) and a manually curated prompt; results may vary with other models or prompts.

When Not To Use

  • Do not generalize prevalence numbers to private or enterprise-only repositories without further sampling.
  • Avoid using the taxonomy as a strict checklist for highly domain-specific projects without tailoring.

Failure Modes

  • Agents produce insecure or inefficient code if manifests omit NFRs like security and performance.
  • Manifests may become append-only and contradictory if not versioned or reviewed.
  • Automated classifiers can miss abstract or low-frequency instructions (maintenance, project management).

Core Entities

Models

  • GPT-5
  • Claude Opus 4.1
  • Gemini 2.5 Pro

Metrics

  • Flesch Reading Ease (FRE)
  • Micro-average F1
  • Precision/Recall/F1 per label
  • Median word counts
  • Mann-Whitney U, Cliff's delta

Datasets

  • AIDev dataset (repos list used for selection)
  • Replication package dataset (Agent-Context-File-Analysis)

Benchmarks

  • SWE-bench (cited as example benchmark work)