Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
3
Why It Matters For Business
Agentic systems are moving into products; you need to verify safety practices before integrating them because public capability docs are common but safety disclosures are rare.
Summary TLDR
The authors build and publish the AI Agent Index: a curated dataset of 67 deployed "agentic" AI systems (agents that plan and act). For each system they record technical components, intended uses, and safety practices from public sources and developer correspondence. Key takeaways: most developers publish documentation (47/67, 70.1%) and many release code (33/67, 49.3%), but few disclose formal safety policies (13/67, 19.4%) or report external safety audits (6/67, 9%). The index and raw data are available online; the paper is a snapshot as of Dec 31, 2024.
Problem Statement
There is no structured, public framework documenting deployed agentic AI systems' technical design, uses, and safety practices. That gap makes it hard for users, auditors, and policymakers to compare systems, assess risks, or design governance.
Main Contribution
A structured template (33 fields) for recording technical, safety, and policy-relevant features of deployed agentic systems.
A public index of 67 deployed agentic systems (snapshot as of Dec 31, 2024) summarizing components, domains, openness, and safety practices.
An analysis that highlights transparency gaps, especially low public disclosure of safety testing and guardrails, and policy suggestions to improve oversight.
Key Findings
The index catalogs 67 deployed agentic AI systems.
Most developers publish documentation and many release code.
Public disclosure of safety and testing is scarce.
Most agents focus on software engineering or computer use.
Results
indexed_systems
public_documentation
code_released
formal_safety_policy
external_safety_evaluations
domain_focus_software_or_computer_use
Who Should Care
What To Try In 7 Days
Browse the index (aiagentindex.mit.edu) and spot agents similar to your use case.
If evaluating an external agent, request its safety policy and audit reports before production.
Run a short red-team or jailbreak test focused on your critical workflows and data flows.
Agent Features
Memory
- internal model weights
- external storage modules for recall
Planning
- chain-of-thought style planning
- orchestrator-driven multi-step plans
Tool Use
- web browsing and posting
- filesystem access and code execution
- API calls to external services
Frameworks
- AutoGen
- Magentic One (example multi-agent system)
Is Agentic
true
Architectures
- foundation model + scaffolding (reasoning, planning, memory, tools)
- multi-agent orchestration (orchestrator + subagents)
Collaboration
- multi-agent cooperation via orchestrator
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Definition of 'agent' is loose and contested; inclusion choices can be subjective.
- Snapshot limited to systems available or announced by Dec 31, 2024; field moves fast.
- Index likely over-represents public and English-language agents and undercounts internal or non-English deployments.
- Developer feedback rate was 36%, so some card fields may miss internal safety practices.
When Not To Use
- When you need exhaustive or up-to-date coverage of every deployed agentic system.
- When assessing internal-only agents or non-English systems not included in the index.
Failure Modes
- Selective disclosure by developers can give false sense of safety.
- Index snapshot can become outdated quickly as new agents and audits appear.
- Over-reliance on public documentation can miss private safety controls or incidents.
Core Entities
Models
- gpt-4o
- OpenAI o1
- ChatGPT-4o
- Llama-3.2-90B-VisionInstruct
Metrics
- percent_with_documentation
- percent_code_release
- percent_with_safety_policy
- percent_with_external_audit
Benchmarks
- GAIA
- SWE-bench
- WebArena
- AssistantBench
- SWE-Bench Verified
Context Entities
Models
- foundation models (general reference)
Benchmarks
- SWE-Bench
- GAIA
- WebArena
- AssistantBench

