Overview
The design is practical: it improves many off-the-shelf LLMs, is training-free, and is backed by multi-benchmark experiments and human evaluation; main caveats are label quality and compute for CoT.
Citations0
Evidence Strength0.90
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
AgentAuditor lets teams improve automated safety/security judgments quickly without expensive fine-tuning. It lowers annotation costs, speeds evaluations, and helps catch action-level and cumulative risks that simple heuristics miss.
Who Should Care
Summary TLDR
AgentAuditor is a training-free framework that turns an LLM into a human-like evaluator of agent behaviors. It builds a structured feature memory (scenario, risk, behavior), clusters representative cases, generates chain-of-thought (CoT) explanations for those cases, and uses multi-stage retrieval to augment judgments on new interactions. The authors also release ASSEBench: 2,293 annotated agent interactions covering 15 risk types and 29 scenarios, with strict/lenient variants. AgentAuditor improves evaluation accuracy across many LLMs and datasets (e.g., Gemini-2: F1 82.27 → 96.31 on R-Judge) and matches or approaches average single-human performance on several benchmarks. Code and data are
Problem Statement
Existing auto-evaluators miss risks that arise from actions, cumulative steps, subtle context, or ambiguous boundaries. Rule-based checks are brittle; vanilla LLM judges are inconsistent and biased. We need a practical, scalable system that reads agent interaction traces, reasons step-by-step, and aligns with human judgments without expensive retraining.
Main Contribution
AgentAuditor: a training-free, memory-augmented RAG + CoT pipeline that equips an LLM to emulate human expert evaluators.
ASSEBench: a 2,293‑record benchmark (4 subsets) for agent safety and security with per-case labels, metadata, and strict/lenient standards.
Key Findings
AgentAuditor gives large, consistent gains across models and datasets.
AgentAuditor reaches human-level judgment on several benchmarks with strong LLMs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| F1 | 96.31 | 82.27 | ↑14.04 (absolute) | R-Judge (Gemini-2 + AgentAuditor) | Table 1, Section 5.2 | Table 1 |
| Accuracy | 96.10 | 81.21 | ↑14.89 (absolute) | R-Judge (Gemini-2 + AgentAuditor) | Table 1, Section 5.2 | Table 1 |
What To Try In 7 Days
Run AgentAuditor on a held-out subset of your agent logs to compare F1/recall vs current rule-based checks.
Build a 20–100‑shot reasoning memory from high-quality examples (paper shows ~24 shots sufficed) and measure gains on ambiguous cases.
Use ASSEBench’s strict/lenient splits to quantify how sensitive your current evaluator is to borderline cases and tune thresholds accordingly.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Performance depends on quality of memory shot labels and CoT correctness.
CoT increases inference cost and latency relative to single-shot prompts.
When Not To Use
When you cannot afford higher inference latency or CoT cost.
When you lack a small set of high-quality annotated examples to seed reasoning memory.
Failure Modes
Corrupted or poisoned reasoning shots can mislead final judgments (white-box attacks have impact).
If >~50% of reasoning-memory labels are wrong, performance can drop below baseline.

