A public index of 67 deployed agentic AI systems that exposes capability documentation but sparse safety disclosure.

February 3, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

3

Authors

Stephen Casper, Luke Bailey, Rosco Hunter, Carson Ezell, Emma Cabalé, Michael Gerovitch, Stewart Slocum, Kevin Wei, Nikola Jurkovic, Ariba Khan, Phillip J. K. Christoffersen, A. Pinar Ozisik, Rakshit Trivedi, Dylan Hadfield-Menell, Noam Kolt

Links

Abstract / PDF

Why It Matters For Business

Agentic systems are moving into products; you need to verify safety practices before integrating them because public capability docs are common but safety disclosures are rare.

Summary TLDR

The authors build and publish the AI Agent Index: a curated dataset of 67 deployed "agentic" AI systems (agents that plan and act). For each system they record technical components, intended uses, and safety practices from public sources and developer correspondence. Key takeaways: most developers publish documentation (47/67, 70.1%) and many release code (33/67, 49.3%), but few disclose formal safety policies (13/67, 19.4%) or report external safety audits (6/67, 9%). The index and raw data are available online; the paper is a snapshot as of Dec 31, 2024.

Problem Statement

There is no structured, public framework documenting deployed agentic AI systems' technical design, uses, and safety practices. That gap makes it hard for users, auditors, and policymakers to compare systems, assess risks, or design governance.

Main Contribution

A structured template (33 fields) for recording technical, safety, and policy-relevant features of deployed agentic systems.

A public index of 67 deployed agentic systems (snapshot as of Dec 31, 2024) summarizing components, domains, openness, and safety practices.

An analysis that highlights transparency gaps, especially low public disclosure of safety testing and guardrails, and policy suggestions to improve oversight.

Key Findings

The index catalogs 67 deployed agentic AI systems.

Numbersn = 67

Most developers publish documentation and many release code.

Numbers47/67 (70.1%) docs; 33/67 (49.3%) code

Public disclosure of safety and testing is scarce.

Numbers13/67 (19.4%) formal safety policies; 6/67 (9%) external audits

Most agents focus on software engineering or computer use.

Numbers50/67 (74.6%)

Results

indexed_systems

Value67 systems

public_documentation

Value70.1%

code_released

Value49.3%

formal_safety_policy

Value19.4%

external_safety_evaluations

Value9%

domain_focus_software_or_computer_use

Value74.6%

Who Should Care

What To Try In 7 Days

Browse the index (aiagentindex.mit.edu) and spot agents similar to your use case.

If evaluating an external agent, request its safety policy and audit reports before production.

Run a short red-team or jailbreak test focused on your critical workflows and data flows.

Agent Features

Memory

  • internal model weights
  • external storage modules for recall

Planning

  • chain-of-thought style planning
  • orchestrator-driven multi-step plans

Tool Use

  • web browsing and posting
  • filesystem access and code execution
  • API calls to external services

Frameworks

  • AutoGen
  • Magentic One (example multi-agent system)

Is Agentic

true

Architectures

  • foundation model + scaffolding (reasoning, planning, memory, tools)
  • multi-agent orchestration (orchestrator + subagents)

Collaboration

  • multi-agent cooperation via orchestrator

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Definition of 'agent' is loose and contested; inclusion choices can be subjective.
  • Snapshot limited to systems available or announced by Dec 31, 2024; field moves fast.
  • Index likely over-represents public and English-language agents and undercounts internal or non-English deployments.
  • Developer feedback rate was 36%, so some card fields may miss internal safety practices.

When Not To Use

  • When you need exhaustive or up-to-date coverage of every deployed agentic system.
  • When assessing internal-only agents or non-English systems not included in the index.

Failure Modes

  • Selective disclosure by developers can give false sense of safety.
  • Index snapshot can become outdated quickly as new agents and audits appear.
  • Over-reliance on public documentation can miss private safety controls or incidents.

Core Entities

Models

  • gpt-4o
  • OpenAI o1
  • ChatGPT-4o
  • Llama-3.2-90B-VisionInstruct

Metrics

  • percent_with_documentation
  • percent_code_release
  • percent_with_safety_policy
  • percent_with_external_audit

Benchmarks

  • GAIA
  • SWE-bench
  • WebArena
  • AssistantBench
  • SWE-Bench Verified

Context Entities

Models

  • foundation models (general reference)

Benchmarks

  • SWE-Bench
  • GAIA
  • WebArena
  • AssistantBench