Overview
Production Readiness
0.4
Novelty Score
0.55
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
Tool-using LLM agents can make costly or unsafe actions; organizations should not deploy them without runtime checks, human-in-loop gating, and improved evaluation tailored to agent behavior.
Summary TLDR
The paper introduces AGENT-SAFETYBENCH, a large interactive benchmark for agent safety: 349 environments, 2,000 test cases, 8 safety categories and 10 failure modes. The authors run 16 popular LLM agents with tool capabilities and find poor safety: no agent scores above 60% and the average safety is 38.5%. They release the dataset and code, provide a finetuned scorer (Qwen-2.5-7B variant) that improves evaluation accuracy, and show simple defense prompts give limited gains. Key practical gaps are agents' lack of robustness in tool use and lack of risk awareness when interacting with environments.
Problem Statement
Existing safety benchmarks focus on text-only risks. Agents that call tools and act in environments create new behavioral safety failures (wrong tool calls, unsafe parameter choices, failing to validate tool outputs). There is no large, systematic benchmark measuring these behavior-level agent risks.
Main Contribution
Agent-SafetyBench: an interactive benchmark with 349 environments, 2,000 test cases, 8 safety categories and 10 annotated failure modes.
A finetuned scorer (based on Qwen-2.5-7B) trained on 4,000 labeled interaction records; it yields much higher evaluation accuracy than directly using GPT-4o.
Large-scale evaluation of 16 LLM agents with tools showing pervasive safety gaps and an analysis that isolates two root defects: lack of robustness and lack of risk awareness.
Key Findings
No tested LLM agent exceeds 60% safety on Agent-SafetyBench.
Average safety across agents is low.
Some risk categories are especially weak (agents spread unsafe info via tools).
Two core failures explain many unsafe behaviors: lack of robustness and lack of risk awareness.
A finetuned scorer improves safety labeling accuracy over using GPT-4o directly.
Simple defense prompts give only limited safety improvements.
Results
Best model safety (total)
Average safety (total)
Spread category average
Accuracy
Who Should Care
What To Try In 7 Days
Run a small subset of your agent workflows through Agent-SafetyBench to map weak failure modes.
Add pre-action validators: require explicit confirmations or parameter checks before each tool call.
Use the finetuned scorer or human review on high-risk flows rather than only relying on off-the-shelf judge models.
Agent Features
Memory
- short-term interaction history used during planning
Planning
- sequential tool-calling loop (analyze -> call -> observe -> repeat)
Tool Use
- function-call style tools (JSON schema)
- simulated external APIs (Python environment)
Frameworks
- JSON tool schema + Python class environments (OpenAI/Claude compatible)
Is Agentic
true
Architectures
- LLM-based agent with tool calls
Reproducibility
License
- MIT (benchmark and code)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Most cases rely on commonsense reasoning; advanced domain-specific scenarios are not covered.
- A portion of augmented cases required manual revision; automated generation quality is mixed.
- Scorer and augmentation use models (GPT-4o, Qwen) which may introduce subtle bias despite validation.
When Not To Use
- As the sole safety check for high-stakes domain deployments (medical, legal, critical infrastructure).
- To evaluate domain-expert knowledge or specialized procedures that require external certification.
Failure Modes
- Generate harmful content without tools (M1)
- Call tools with incomplete information or fabricated parameters (M2, M3)
- Ignore constraints or implicit risks and call tools (M4, M5)
- Use wrong tool parameters or trust tool outputs without validation (M6, M9)
- Call known risky tools or fail to call necessary ones (M7, M8)
- Fail to filter multiple tool results (M10)
Core Entities
Models
- Claude-3-Opus
- Claude-3.5-Sonnet
- Claude-3.5-Haiku
- GPT-4o
- GPT-4-Turbo
- Gemini-1.5-Flash
- Gemini-1.5-Pro
- Qwen2.5-72B-Instruct
- GLM4-9B-Chat
- Llama3.1-405B-Instruct
- DeepSeek-V2.5
- Qwen2.5-14B-Instruct
- GPT-4o-mini
- Llama3.1-70B-Instruct
- Llama3.1-8B-Instruct
- Qwen2.5-7B-Instruct
Metrics
- Safety Score (%)
- Failure-mode specific safety (%)
- Accuracy
Datasets
- Agent-SafetyBench (this work)
- R-Judge
- AgentDojo
- GuardAgent
- ToolEmu
- ToolSword
- InjecAgent
- AdvBench
Benchmarks
- AGENT-SAFETYBENCH

