Agent-SafetyBench: 2,000 agent tests across 349 environments — no tested agent exceeds 60% safety

December 19, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.55

Cost Impact Score

0.6

Citation Count

2

Authors

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, Minlie Huang

Links

Abstract / PDF

Why It Matters For Business

Tool-using LLM agents can make costly or unsafe actions; organizations should not deploy them without runtime checks, human-in-loop gating, and improved evaluation tailored to agent behavior.

Summary TLDR

The paper introduces AGENT-SAFETYBENCH, a large interactive benchmark for agent safety: 349 environments, 2,000 test cases, 8 safety categories and 10 failure modes. The authors run 16 popular LLM agents with tool capabilities and find poor safety: no agent scores above 60% and the average safety is 38.5%. They release the dataset and code, provide a finetuned scorer (Qwen-2.5-7B variant) that improves evaluation accuracy, and show simple defense prompts give limited gains. Key practical gaps are agents' lack of robustness in tool use and lack of risk awareness when interacting with environments.

Problem Statement

Existing safety benchmarks focus on text-only risks. Agents that call tools and act in environments create new behavioral safety failures (wrong tool calls, unsafe parameter choices, failing to validate tool outputs). There is no large, systematic benchmark measuring these behavior-level agent risks.

Main Contribution

Agent-SafetyBench: an interactive benchmark with 349 environments, 2,000 test cases, 8 safety categories and 10 annotated failure modes.

A finetuned scorer (based on Qwen-2.5-7B) trained on 4,000 labeled interaction records; it yields much higher evaluation accuracy than directly using GPT-4o.

Large-scale evaluation of 16 LLM agents with tools showing pervasive safety gaps and an analysis that isolates two root defects: lack of robustness and lack of risk awareness.

Key Findings

No tested LLM agent exceeds 60% safety on Agent-SafetyBench.

NumbersBest model 59.8% (Claude-3-Opus); all <60%

Average safety across agents is low.

NumbersAverage total safety = 38.5%

Some risk categories are especially weak (agents spread unsafe info via tools).

NumbersAverage score on 'Spread' = 15.6%

Two core failures explain many unsafe behaviors: lack of robustness and lack of risk awareness.

A finetuned scorer improves safety labeling accuracy over using GPT-4o directly.

NumbersFinetuned scorer 91.5% vs GPT-4o 75.5% accuracy on sampled records (~+16 pp)

Simple defense prompts give only limited safety improvements.

NumbersTop model remains <70% even with enhanced defense prompt (example: Claude-3.5-Sonnet)

Results

Best model safety (total)

Value59.8%

Average safety (total)

Value38.5%

Spread category average

Value15.6%

Accuracy

Value91.5%

BaselineGPT-4o 75.5%

Who Should Care

What To Try In 7 Days

Run a small subset of your agent workflows through Agent-SafetyBench to map weak failure modes.

Add pre-action validators: require explicit confirmations or parameter checks before each tool call.

Use the finetuned scorer or human review on high-risk flows rather than only relying on off-the-shelf judge models.

Agent Features

Memory

  • short-term interaction history used during planning

Planning

  • sequential tool-calling loop (analyze -> call -> observe -> repeat)

Tool Use

  • function-call style tools (JSON schema)
  • simulated external APIs (Python environment)

Frameworks

  • JSON tool schema + Python class environments (OpenAI/Claude compatible)

Is Agentic

true

Architectures

  • LLM-based agent with tool calls

Reproducibility

License

  • MIT (benchmark and code)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Most cases rely on commonsense reasoning; advanced domain-specific scenarios are not covered.
  • A portion of augmented cases required manual revision; automated generation quality is mixed.
  • Scorer and augmentation use models (GPT-4o, Qwen) which may introduce subtle bias despite validation.

When Not To Use

  • As the sole safety check for high-stakes domain deployments (medical, legal, critical infrastructure).
  • To evaluate domain-expert knowledge or specialized procedures that require external certification.

Failure Modes

  • Generate harmful content without tools (M1)
  • Call tools with incomplete information or fabricated parameters (M2, M3)
  • Ignore constraints or implicit risks and call tools (M4, M5)
  • Use wrong tool parameters or trust tool outputs without validation (M6, M9)
  • Call known risky tools or fail to call necessary ones (M7, M8)
  • Fail to filter multiple tool results (M10)

Core Entities

Models

  • Claude-3-Opus
  • Claude-3.5-Sonnet
  • Claude-3.5-Haiku
  • GPT-4o
  • GPT-4-Turbo
  • Gemini-1.5-Flash
  • Gemini-1.5-Pro
  • Qwen2.5-72B-Instruct
  • GLM4-9B-Chat
  • Llama3.1-405B-Instruct
  • DeepSeek-V2.5
  • Qwen2.5-14B-Instruct
  • GPT-4o-mini
  • Llama3.1-70B-Instruct
  • Llama3.1-8B-Instruct
  • Qwen2.5-7B-Instruct

Metrics

  • Safety Score (%)
  • Failure-mode specific safety (%)
  • Accuracy

Datasets

  • Agent-SafetyBench (this work)
  • R-Judge
  • AgentDojo
  • GuardAgent
  • ToolEmu
  • ToolSword
  • InjecAgent
  • AdvBench

Benchmarks

  • AGENT-SAFETYBENCH