AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

May 31, 20258 min

Overview

Decision SnapshotNeeds Validation

The design is practical: it improves many off-the-shelf LLMs, is training-free, and is backed by multi-benchmark experiments and human evaluation; main caveats are label quality and compute for CoT.

Citations0

Evidence Strength0.90

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AgentAuditor lets teams improve automated safety/security judgments quickly without expensive fine-tuning. It lowers annotation costs, speeds evaluations, and helps catch action-level and cumulative risks that simple heuristics miss.

Who Should Care

Summary TLDR

AgentAuditor is a training-free framework that turns an LLM into a human-like evaluator of agent behaviors. It builds a structured feature memory (scenario, risk, behavior), clusters representative cases, generates chain-of-thought (CoT) explanations for those cases, and uses multi-stage retrieval to augment judgments on new interactions. The authors also release ASSEBench: 2,293 annotated agent interactions covering 15 risk types and 29 scenarios, with strict/lenient variants. AgentAuditor improves evaluation accuracy across many LLMs and datasets (e.g., Gemini-2: F1 82.27 → 96.31 on R-Judge) and matches or approaches average single-human performance on several benchmarks. Code and data are

Problem Statement

Existing auto-evaluators miss risks that arise from actions, cumulative steps, subtle context, or ambiguous boundaries. Rule-based checks are brittle; vanilla LLM judges are inconsistent and biased. We need a practical, scalable system that reads agent interaction traces, reasons step-by-step, and aligns with human judgments without expensive retraining.

Main Contribution

AgentAuditor: a training-free, memory-augmented RAG + CoT pipeline that equips an LLM to emulate human expert evaluators.

ASSEBench: a 2,293‑record benchmark (4 subsets) for agent safety and security with per-case labels, metadata, and strict/lenient standards.

Key Findings

AgentAuditor gives large, consistent gains across models and datasets.

Numberse.g., Gemini-2 F1 R-Judge 82.2796.31 (Table 1)

Practical UseIf you add AgentAuditor to an existing LLM evaluator you can often raise binary safe/unsafe accuracy and F1 substantially without fine-tuning.

Evidence RefTable 1, Section 5.2

AgentAuditor reaches human-level judgment on several benchmarks with strong LLMs.

NumbersGemini-2 +AgentAuditor: Acc 96.1% on R-Judge vs avg. human 95.7% (Table 22)

Practical UseUsing a strong base LLM plus AgentAuditor can replace or augment a human-in-the-loop for many judgment tasks, cutting annotation bottlenecks.

Evidence RefSection 5.2, Table 22

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
F196.3182.2714.04 (absolute)R-Judge (Gemini-2 + AgentAuditor)Table 1, Section 5.2Table 1
Accuracy96.1081.2114.89 (absolute)R-Judge (Gemini-2 + AgentAuditor)Table 1, Section 5.2Table 1

What To Try In 7 Days

Run AgentAuditor on a held-out subset of your agent logs to compare F1/recall vs current rule-based checks.

Build a 20–100‑shot reasoning memory from high-quality examples (paper shows ~24 shots sufficed) and measure gains on ambiguous cases.

Use ASSEBench’s strict/lenient splits to quantify how sensitive your current evaluator is to borderline cases and tune thresholds accordingly.

Agent Features

Memory
structured feature memory (scenario,risk,behavior)vectorized embeddingsreasoning memory of CoT traces
Planning
few-shot CoT promptingmulti-shot retrieval-augmented reasoning
Tool Use
tool-call sequence analysisenvironment action reasoning
Frameworks
FINCH clusteringNomic-Embed-Text-v1.5 embeddingsmulti-stage RAG pipeline
Is Agentic

Yes

Architectures
LLMRAGCoT
Collaboration
human-in-the-loop annotation for dataset creation

Optimization Features

Token Efficiency
reduces need for large fine-tuning datasets (24 vs 4,000 labels)
Infra Optimization
supports smaller models to reach larger-model performance via memory augmentation
System Optimization
PCA for dimensionality reduction before clustering
Training Optimization
training-free evaluation (no fine-tuning required)
Inference Optimization
few-shot CoT prompts assembled from k retrieved examplesmulti-stage reranking to pick top-k CoT examples

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Performance depends on quality of memory shot labels and CoT correctness.

CoT increases inference cost and latency relative to single-shot prompts.

When Not To Use

When you cannot afford higher inference latency or CoT cost.

When you lack a small set of high-quality annotated examples to seed reasoning memory.

Failure Modes

Corrupted or poisoned reasoning shots can mislead final judgments (white-box attacks have impact).

If >~50% of reasoning-memory labels are wrong, performance can drop below baseline.

Core Entities

Models

Gemini-2.0-Flash-thinkingGPT-4oGPT-4.1Claude-3.5-sonnetDeepseek-v3QwQ-32BQwen-2.5-32BQwen-2.5-7BLlama-3.1-8BShieldAgent (fine-tuned Qwen-2.5-7B)Llama-Guard-3

Metrics

F1AccuracyRecallPrecisionASRRefusal Rate

Datasets

ASSEBench-SecurityASSEBench-SafetyASSEBench-StrictASSEBench-LenientR-JudgeAgentHarmAgentSafetyBenchAgentSecurityBenchAgentDojo

Benchmarks

R-JudgeAgentHarmAgentSafetyBenchAgentSecurityBenchAgentDojo

Context Entities

Models

Gemini-2.0-Flash-thinking (reasoning-optimized)GPT-o3-miniQwQ-32B (open)Qwen family (open)

Metrics

F1, Acc (scaled 0-100)Recall prioritized for safety

Datasets

ToolEmuAgentDojoAgentHarmAgentSafetyBenchAgentSecurityBench

Benchmarks

R-Judge (existing evaluator benchmark)