Agent-SafetyBench: 2,000 agent tests across 349 environments — no tested agent exceeds 60% safety

December 19, 20247 min

Overview

Decision SnapshotNeeds Validation

The benchmark is mature enough to surface practical failures and guide fixes, but agents and defenses still need engineering work; results are backed by 2,000 cases and labeled interactions.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

License: MIT (benchmark and code)

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 55%

Authors

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, Minlie Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Tool-using LLM agents can make costly or unsafe actions; organizations should not deploy them without runtime checks, human-in-loop gating, and improved evaluation tailored to agent behavior.

Who Should Care

Summary TLDR

The paper introduces AGENT-SAFETYBENCH, a large interactive benchmark for agent safety: 349 environments, 2,000 test cases, 8 safety categories and 10 failure modes. The authors run 16 popular LLM agents with tool capabilities and find poor safety: no agent scores above 60% and the average safety is 38.5%. They release the dataset and code, provide a finetuned scorer (Qwen-2.5-7B variant) that improves evaluation accuracy, and show simple defense prompts give limited gains. Key practical gaps are agents' lack of robustness in tool use and lack of risk awareness when interacting with environments.

Problem Statement

Existing safety benchmarks focus on text-only risks. Agents that call tools and act in environments create new behavioral safety failures (wrong tool calls, unsafe parameter choices, failing to validate tool outputs). There is no large, systematic benchmark measuring these behavior-level agent risks.

Main Contribution

Agent-SafetyBench: an interactive benchmark with 349 environments, 2,000 test cases, 8 safety categories and 10 annotated failure modes.

A finetuned scorer (based on Qwen-2.5-7B) trained on 4,000 labeled interaction records; it yields much higher evaluation accuracy than directly using GPT-4o.

Key Findings

No tested LLM agent exceeds 60% safety on Agent-SafetyBench.

NumbersBest model 59.8% (Claude-3-Opus); all <60%

Practical UseDo not assume current tool-using agents are safe in realistic interactive tasks; plan engineering safeguards and human oversight for any deployment.

Evidence RefTable 5

Average safety across agents is low.

NumbersAverage total safety = 38.5%

Practical UseBenchmark-level average suggests substantial engineering work (policy, validation, or retraining) is required before using agents in risky domains.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Best model safety (total)59.8%Agent-SafetyBench (all cases)Table 5 (Claude-3-Opus = 59.8%)Table 5
Average safety (total)38.5%Agent-SafetyBench (all cases)Table 5 (Average row)Table 5

What To Try In 7 Days

Run a small subset of your agent workflows through Agent-SafetyBench to map weak failure modes.

Add pre-action validators: require explicit confirmations or parameter checks before each tool call.

Use the finetuned scorer or human review on high-risk flows rather than only relying on off-the-shelf judge models.

Agent Features

Memory
short-term interaction history used during planning
Planning
sequential tool-calling loop (analyze -> call -> observe -> repeat)
Tool Use
function-call style tools (JSON schema)simulated external APIs (Python environment)
Frameworks
JSON tool schema + Python class environments (OpenAI/Claude compatible)
Is Agentic

Yes

Architectures
LLM-based agent with tool calls

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseMIT (benchmark and code)

Risks & Boundaries

Limitations

Most cases rely on commonsense reasoning; advanced domain-specific scenarios are not covered.

A portion of augmented cases required manual revision; automated generation quality is mixed.

When Not To Use

As the sole safety check for high-stakes domain deployments (medical, legal, critical infrastructure).

To evaluate domain-expert knowledge or specialized procedures that require external certification.

Failure Modes

Generate harmful content without tools (M1)

Call tools with incomplete information or fabricated parameters (M2, M3)

Core Entities

Models

Claude-3-OpusClaude-3.5-SonnetClaude-3.5-HaikuGPT-4oGPT-4-TurboGemini-1.5-FlashGemini-1.5-ProQwen2.5-72B-InstructGLM4-9B-ChatLlama3.1-405B-InstructDeepSeek-V2.5Qwen2.5-14B-InstructGPT-4o-miniLlama3.1-70B-InstructLlama3.1-8B-InstructQwen2.5-7B-Instruct

Metrics

Safety Score (%)Failure-mode specific safety (%)Accuracy

Datasets

Agent-SafetyBench (this work)R-JudgeAgentDojoGuardAgentToolEmuToolSwordInjecAgentAdvBench

Benchmarks

AGENT-SAFETYBENCH