Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
AgentAuditor lets teams improve automated safety/security judgments quickly without expensive fine-tuning. It lowers annotation costs, speeds evaluations, and helps catch action-level and cumulative risks that simple heuristics miss.
Summary TLDR
AgentAuditor is a training-free framework that turns an LLM into a human-like evaluator of agent behaviors. It builds a structured feature memory (scenario, risk, behavior), clusters representative cases, generates chain-of-thought (CoT) explanations for those cases, and uses multi-stage retrieval to augment judgments on new interactions. The authors also release ASSEBench: 2,293 annotated agent interactions covering 15 risk types and 29 scenarios, with strict/lenient variants. AgentAuditor improves evaluation accuracy across many LLMs and datasets (e.g., Gemini-2: F1 82.27 → 96.31 on R-Judge) and matches or approaches average single-human performance on several benchmarks. Code and data are
Problem Statement
Existing auto-evaluators miss risks that arise from actions, cumulative steps, subtle context, or ambiguous boundaries. Rule-based checks are brittle; vanilla LLM judges are inconsistent and biased. We need a practical, scalable system that reads agent interaction traces, reasons step-by-step, and aligns with human judgments without expensive retraining.
Main Contribution
AgentAuditor: a training-free, memory-augmented RAG + CoT pipeline that equips an LLM to emulate human expert evaluators.
ASSEBench: a 2,293‑record benchmark (4 subsets) for agent safety and security with per-case labels, metadata, and strict/lenient standards.
Empirical claim: AgentAuditor consistently raises evaluation performance across many LLMs and datasets and reaches or nears human-level accuracy in several tests.
Open release: code and dataset published to enable reproduction and follow-up work.
Key Findings
AgentAuditor gives large, consistent gains across models and datasets.
AgentAuditor reaches human-level judgment on several benchmarks with strong LLMs.
Small reasoning memory and no fine-tuning lower data and pretrain costs.
Robustness holds under label noise and targeted attacks until a threshold.
Results
F1
Accuracy
F1
F1
Resource: annotated records
Who Should Care
What To Try In 7 Days
Run AgentAuditor on a held-out subset of your agent logs to compare F1/recall vs current rule-based checks.
Build a 20–100‑shot reasoning memory from high-quality examples (paper shows ~24 shots sufficed) and measure gains on ambiguous cases.
Use ASSEBench’s strict/lenient splits to quantify how sensitive your current evaluator is to borderline cases and tune thresholds accordingly.
Agent Features
Memory
- structured feature memory (scenario,risk,behavior)
- vectorized embeddings
- reasoning memory of CoT traces
Planning
- few-shot CoT prompting
- multi-shot retrieval-augmented reasoning
Tool Use
- tool-call sequence analysis
- environment action reasoning
Frameworks
- FINCH clustering
- Nomic-Embed-Text-v1.5 embeddings
- multi-stage RAG pipeline
Is Agentic
true
Architectures
- LLM
- RAG
- CoT
Collaboration
- human-in-the-loop annotation for dataset creation
Optimization Features
Token Efficiency
- reduces need for large fine-tuning datasets (24 vs 4,000 labels)
Infra Optimization
- supports smaller models to reach larger-model performance via memory augmentation
System Optimization
- PCA for dimensionality reduction before clustering
Training Optimization
- training-free evaluation (no fine-tuning required)
Inference Optimization
- few-shot CoT prompts assembled from k retrieved examples
- multi-stage reranking to pick top-k CoT examples
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Performance depends on quality of memory shot labels and CoT correctness.
- CoT increases inference cost and latency relative to single-shot prompts.
- Several hyperparameters are heuristic and tuned empirically; universality not fully proven.
- ASSEBench and experiments are English-only and use binary safe/unsafe labels.
When Not To Use
- When you cannot afford higher inference latency or CoT cost.
- When you lack a small set of high-quality annotated examples to seed reasoning memory.
- When you need multilingual coverage not supported by current memory and datasets.
- When a multi-class or graded safety taxonomy (not binary) is required without further extensions.
Failure Modes
- Corrupted or poisoned reasoning shots can mislead final judgments (white-box attacks have impact).
- If >~50% of reasoning-memory labels are wrong, performance can drop below baseline.
- Over-analysis by CoT can increase false positives when used without representative examples.
- Poor embedding or retrieval quality yields weaker gains (random-shot experiments show lower improvement).
Core Entities
Models
- Gemini-2.0-Flash-thinking
- GPT-4o
- GPT-4.1
- Claude-3.5-sonnet
- Deepseek-v3
- QwQ-32B
- Qwen-2.5-32B
- Qwen-2.5-7B
- Llama-3.1-8B
- ShieldAgent (fine-tuned Qwen-2.5-7B)
- Llama-Guard-3
Metrics
- F1
- Accuracy
- Recall
- Precision
- ASR
- Refusal Rate
Datasets
- ASSEBench-Security
- ASSEBench-Safety
- ASSEBench-Strict
- ASSEBench-Lenient
- R-Judge
- AgentHarm
- AgentSafetyBench
- AgentSecurityBench
- AgentDojo
Benchmarks
- R-Judge
- AgentHarm
- AgentSafetyBench
- AgentSecurityBench
- AgentDojo
Context Entities
Models
- Gemini-2.0-Flash-thinking (reasoning-optimized)
- GPT-o3-mini
- QwQ-32B (open)
- Qwen family (open)
Metrics
- F1, Acc (scaled 0-100)
- Recall prioritized for safety
Datasets
- ToolEmu
- AgentDojo
- AgentHarm
- AgentSafetyBench
- AgentSecurityBench
Benchmarks
- R-Judge (existing evaluator benchmark)

