Overview
The benchmark is mature enough to surface practical failures and guide fixes, but agents and defenses still need engineering work; results are backed by 2,000 cases and labeled interactions.
Citations2
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/6
Findings with evidence refs: 6/6
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Yes
License: MIT (benchmark and code)
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 55%
Why It Matters For Business
Tool-using LLM agents can make costly or unsafe actions; organizations should not deploy them without runtime checks, human-in-loop gating, and improved evaluation tailored to agent behavior.
Who Should Care
Summary TLDR
The paper introduces AGENT-SAFETYBENCH, a large interactive benchmark for agent safety: 349 environments, 2,000 test cases, 8 safety categories and 10 failure modes. The authors run 16 popular LLM agents with tool capabilities and find poor safety: no agent scores above 60% and the average safety is 38.5%. They release the dataset and code, provide a finetuned scorer (Qwen-2.5-7B variant) that improves evaluation accuracy, and show simple defense prompts give limited gains. Key practical gaps are agents' lack of robustness in tool use and lack of risk awareness when interacting with environments.
Problem Statement
Existing safety benchmarks focus on text-only risks. Agents that call tools and act in environments create new behavioral safety failures (wrong tool calls, unsafe parameter choices, failing to validate tool outputs). There is no large, systematic benchmark measuring these behavior-level agent risks.
Main Contribution
Agent-SafetyBench: an interactive benchmark with 349 environments, 2,000 test cases, 8 safety categories and 10 annotated failure modes.
A finetuned scorer (based on Qwen-2.5-7B) trained on 4,000 labeled interaction records; it yields much higher evaluation accuracy than directly using GPT-4o.
Key Findings
No tested LLM agent exceeds 60% safety on Agent-SafetyBench.
Average safety across agents is low.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Best model safety (total) | 59.8% | — | — | Agent-SafetyBench (all cases) | Table 5 (Claude-3-Opus = 59.8%) | Table 5 |
| Average safety (total) | 38.5% | — | — | Agent-SafetyBench (all cases) | Table 5 (Average row) | Table 5 |
What To Try In 7 Days
Run a small subset of your agent workflows through Agent-SafetyBench to map weak failure modes.
Add pre-action validators: require explicit confirmations or parameter checks before each tool call.
Use the finetuned scorer or human review on high-risk flows rather than only relying on off-the-shelf judge models.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
Most cases rely on commonsense reasoning; advanced domain-specific scenarios are not covered.
A portion of augmented cases required manual revision; automated generation quality is mixed.
When Not To Use
As the sole safety check for high-stakes domain deployments (medical, legal, critical infrastructure).
To evaluate domain-expert knowledge or specialized procedures that require external certification.
Failure Modes
Generate harmful content without tools (M1)
Call tools with incomplete information or fabricated parameters (M2, M3)

