AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Overview

Decision SnapshotNeeds Validation

The design is practical: it improves many off-the-shelf LLMs, is training-free, and is backed by multi-benchmark experiments and human evaluation; main caveats are label quality and compute for CoT.

Citations0

Evidence Strength0.90

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AgentAuditor lets teams improve automated safety/security judgments quickly without expensive fine-tuning. It lowers annotation costs, speeds evaluations, and helps catch action-level and cumulative risks that simple heuristics miss.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

AgentAuditor is a training-free framework that turns an LLM into a human-like evaluator of agent behaviors. It builds a structured feature memory (scenario, risk, behavior), clusters representative cases, generates chain-of-thought (CoT) explanations for those cases, and uses multi-stage retrieval to augment judgments on new interactions. The authors also release ASSEBench: 2,293 annotated agent interactions covering 15 risk types and 29 scenarios, with strict/lenient variants. AgentAuditor improves evaluation accuracy across many LLMs and datasets (e.g., Gemini-2: F1 82.27 → 96.31 on R-Judge) and matches or approaches average single-human performance on several benchmarks. Code and data are

Problem Statement

Existing auto-evaluators miss risks that arise from actions, cumulative steps, subtle context, or ambiguous boundaries. Rule-based checks are brittle; vanilla LLM judges are inconsistent and biased. We need a practical, scalable system that reads agent interaction traces, reasons step-by-step, and aligns with human judgments without expensive retraining.

Main Contribution

AgentAuditor: a training-free, memory-augmented RAG + CoT pipeline that equips an LLM to emulate human expert evaluators.

ASSEBench: a 2,293‑record benchmark (4 subsets) for agent safety and security with per-case labels, metadata, and strict/lenient standards.

Key Findings

AgentAuditor gives large, consistent gains across models and datasets.

Numberse.g., Gemini-2 F1 R-Judge 82.27 → 96.31 (Table 1)

Practical UseIf you add AgentAuditor to an existing LLM evaluator you can often raise binary safe/unsafe accuracy and F1 substantially without fine-tuning.

Evidence RefTable 1, Section 5.2

AgentAuditor reaches human-level judgment on several benchmarks with strong LLMs.

NumbersGemini-2 +AgentAuditor: Acc 96.1% on R-Judge vs avg. human 95.7% (Table 22)

Practical UseUsing a strong base LLM plus AgentAuditor can replace or augment a human-in-the-loop for many judgment tasks, cutting annotation bottlenecks.

Evidence RefSection 5.2, Table 22

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
F1	96.31	82.27	↑14.04 (absolute)	R-Judge (Gemini-2 + AgentAuditor)	Table 1, Section 5.2	Table 1
Accuracy	96.10	81.21	↑14.89 (absolute)	R-Judge (Gemini-2 + AgentAuditor)	Table 1, Section 5.2	Table 1

What To Try In 7 Days

Run AgentAuditor on a held-out subset of your agent logs to compare F1/recall vs current rule-based checks.

Build a 20–100‑shot reasoning memory from high-quality examples (paper shows ~24 shots sufficed) and measure gains on ambiguous cases.

Use ASSEBench’s strict/lenient splits to quantify how sensitive your current evaluator is to borderline cases and tune thresholds accordingly.

Agent Features

Memory

structured feature memory (scenario,risk,behavior)vectorized embeddingsreasoning memory of CoT traces

Planning

few-shot CoT promptingmulti-shot retrieval-augmented reasoning

Tool Use

tool-call sequence analysisenvironment action reasoning

Frameworks

FINCH clusteringNomic-Embed-Text-v1.5 embeddingsmulti-stage RAG pipeline

Is Agentic

Yes

Architectures

LLMRAGCoT

Collaboration

human-in-the-loop annotation for dataset creation

Optimization Features

Token Efficiency

reduces need for large fine-tuning datasets (24 vs 4,000 labels)

Infra Optimization

supports smaller models to reach larger-model performance via memory augmentation

System Optimization

PCA for dimensionality reduction before clustering

Training Optimization

training-free evaluation (no fine-tuning required)

Inference Optimization

few-shot CoT prompts assembled from k retrieved examplesmulti-stage reranking to pick top-k CoT examples

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Astarojth/AgentAuditor-ASSEBench

Data URLs

https://github.com/Astarojth/AgentAuditor-ASSEBench

Risks & Boundaries

Limitations

Performance depends on quality of memory shot labels and CoT correctness.

CoT increases inference cost and latency relative to single-shot prompts.

When Not To Use

When you cannot afford higher inference latency or CoT cost.

When you lack a small set of high-quality annotated examples to seed reasoning memory.

Failure Modes

Corrupted or poisoned reasoning shots can mislead final judgments (white-box attacks have impact).

If >~50% of reasoning-memory labels are wrong, performance can drop below baseline.

Core Entities

Models

Gemini-2.0-Flash-thinkingGPT-4oGPT-4.1Claude-3.5-sonnetDeepseek-v3QwQ-32BQwen-2.5-32BQwen-2.5-7BLlama-3.1-8BShieldAgent (fine-tuned Qwen-2.5-7B)Llama-Guard-3

Metrics

F1AccuracyRecallPrecisionASRRefusal Rate

Datasets

ASSEBench-SecurityASSEBench-SafetyASSEBench-StrictASSEBench-LenientR-JudgeAgentHarmAgentSafetyBenchAgentSecurityBenchAgentDojo

Benchmarks

R-JudgeAgentHarmAgentSafetyBenchAgentSecurityBenchAgentDojo

Context Entities

Models

Gemini-2.0-Flash-thinking (reasoning-optimized)GPT-o3-miniQwQ-32B (open)Qwen family (open)

Metrics

F1, Acc (scaled 0-100)Recall prioritized for safety

Datasets

ToolEmuAgentDojoAgentHarmAgentSafetyBenchAgentSecurityBench

Benchmarks

R-Judge (existing evaluator benchmark)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AgentAuditor gives large, consistent gains across models and datasets.

AgentAuditor reaches human-level judgment on several benchmarks with strong LLMs.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding