Agent-SafetyBench: 2,000 agent tests across 349 environments — no tested agent exceeds 60% safety

Overview

Decision SnapshotNeeds Validation

The benchmark is mature enough to surface practical failures and guide fixes, but agents and defenses still need engineering work; results are backed by 2,000 cases and labeled interactions.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

License: MIT (benchmark and code)

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 55%

Authors

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, Minlie Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Tool-using LLM agents can make costly or unsafe actions; organizations should not deploy them without runtime checks, human-in-loop gating, and improved evaluation tailored to agent behavior.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper introduces AGENT-SAFETYBENCH, a large interactive benchmark for agent safety: 349 environments, 2,000 test cases, 8 safety categories and 10 failure modes. The authors run 16 popular LLM agents with tool capabilities and find poor safety: no agent scores above 60% and the average safety is 38.5%. They release the dataset and code, provide a finetuned scorer (Qwen-2.5-7B variant) that improves evaluation accuracy, and show simple defense prompts give limited gains. Key practical gaps are agents' lack of robustness in tool use and lack of risk awareness when interacting with environments.

Problem Statement

Existing safety benchmarks focus on text-only risks. Agents that call tools and act in environments create new behavioral safety failures (wrong tool calls, unsafe parameter choices, failing to validate tool outputs). There is no large, systematic benchmark measuring these behavior-level agent risks.

Main Contribution

Agent-SafetyBench: an interactive benchmark with 349 environments, 2,000 test cases, 8 safety categories and 10 annotated failure modes.

A finetuned scorer (based on Qwen-2.5-7B) trained on 4,000 labeled interaction records; it yields much higher evaluation accuracy than directly using GPT-4o.

Key Findings

No tested LLM agent exceeds 60% safety on Agent-SafetyBench.

NumbersBest model 59.8% (Claude-3-Opus); all <60%

Practical UseDo not assume current tool-using agents are safe in realistic interactive tasks; plan engineering safeguards and human oversight for any deployment.

Evidence RefTable 5

Average safety across agents is low.

NumbersAverage total safety = 38.5%

Practical UseBenchmark-level average suggests substantial engineering work (policy, validation, or retraining) is required before using agents in risky domains.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Best model safety (total)	59.8%	—	—	Agent-SafetyBench (all cases)	Table 5 (Claude-3-Opus = 59.8%)	Table 5
Average safety (total)	38.5%	—	—	Agent-SafetyBench (all cases)	Table 5 (Average row)	Table 5

What To Try In 7 Days

Run a small subset of your agent workflows through Agent-SafetyBench to map weak failure modes.

Add pre-action validators: require explicit confirmations or parameter checks before each tool call.

Use the finetuned scorer or human review on high-risk flows rather than only relying on off-the-shelf judge models.

Agent Features

Memory

short-term interaction history used during planning

Planning

sequential tool-calling loop (analyze -> call -> observe -> repeat)

Tool Use

function-call style tools (JSON schema)simulated external APIs (Python environment)

Frameworks

JSON tool schema + Python class environments (OpenAI/Claude compatible)

Is Agentic

Yes

Architectures

LLM-based agent with tool calls

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseMIT (benchmark and code)

Code URLs

https://github.com/thu-coai/Agent-SafetyBench/

Data URLs

https://github.com/thu-coai/Agent-SafetyBench/

Risks & Boundaries

Limitations

Most cases rely on commonsense reasoning; advanced domain-specific scenarios are not covered.

A portion of augmented cases required manual revision; automated generation quality is mixed.

When Not To Use

As the sole safety check for high-stakes domain deployments (medical, legal, critical infrastructure).

To evaluate domain-expert knowledge or specialized procedures that require external certification.

Failure Modes

Generate harmful content without tools (M1)

Call tools with incomplete information or fabricated parameters (M2, M3)

Core Entities

Models

Claude-3-OpusClaude-3.5-SonnetClaude-3.5-HaikuGPT-4oGPT-4-TurboGemini-1.5-FlashGemini-1.5-ProQwen2.5-72B-InstructGLM4-9B-ChatLlama3.1-405B-InstructDeepSeek-V2.5Qwen2.5-14B-InstructGPT-4o-miniLlama3.1-70B-InstructLlama3.1-8B-InstructQwen2.5-7B-Instruct

Metrics

Safety Score (%)Failure-mode specific safety (%)Accuracy

Datasets

Agent-SafetyBench (this work)R-JudgeAgentDojoGuardAgentToolEmuToolSwordInjecAgentAdvBench

Benchmarks

AGENT-SAFETYBENCH

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

No tested LLM agent exceeds 60% safety on Agent-SafetyBench.

Average safety across agents is low.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding