AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

May 31, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam

Links

Abstract / PDF

Why It Matters For Business

AgentAuditor lets teams improve automated safety/security judgments quickly without expensive fine-tuning. It lowers annotation costs, speeds evaluations, and helps catch action-level and cumulative risks that simple heuristics miss.

Summary TLDR

AgentAuditor is a training-free framework that turns an LLM into a human-like evaluator of agent behaviors. It builds a structured feature memory (scenario, risk, behavior), clusters representative cases, generates chain-of-thought (CoT) explanations for those cases, and uses multi-stage retrieval to augment judgments on new interactions. The authors also release ASSEBench: 2,293 annotated agent interactions covering 15 risk types and 29 scenarios, with strict/lenient variants. AgentAuditor improves evaluation accuracy across many LLMs and datasets (e.g., Gemini-2: F1 82.27 → 96.31 on R-Judge) and matches or approaches average single-human performance on several benchmarks. Code and data are

Problem Statement

Existing auto-evaluators miss risks that arise from actions, cumulative steps, subtle context, or ambiguous boundaries. Rule-based checks are brittle; vanilla LLM judges are inconsistent and biased. We need a practical, scalable system that reads agent interaction traces, reasons step-by-step, and aligns with human judgments without expensive retraining.

Main Contribution

AgentAuditor: a training-free, memory-augmented RAG + CoT pipeline that equips an LLM to emulate human expert evaluators.

ASSEBench: a 2,293‑record benchmark (4 subsets) for agent safety and security with per-case labels, metadata, and strict/lenient standards.

Empirical claim: AgentAuditor consistently raises evaluation performance across many LLMs and datasets and reaches or nears human-level accuracy in several tests.

Open release: code and dataset published to enable reproduction and follow-up work.

Key Findings

AgentAuditor gives large, consistent gains across models and datasets.

Numberse.g., Gemini-2 F1 R-Judge 82.27 → 96.31 (Table 1)

AgentAuditor reaches human-level judgment on several benchmarks with strong LLMs.

NumbersGemini-2 +AgentAuditor: Acc 96.1% on R-Judge vs avg. human 95.7% (Table 22)

Small reasoning memory and no fine-tuning lower data and pretrain costs.

NumbersReasoning memory uses ~24–73 shots; AgentAuditor needs 24 annotated records vs ShieldAgent fine-tuned on 4,000 (Table 21

Robustness holds under label noise and targeted attacks until a threshold.

NumbersF1 still > baseline with up to ~33% noisy/poisoned shots; drops below baseline ~50% noise (Section O.1–O.2)

Results

F1

Value96.31

Baseline82.27

Accuracy

Value96.10

Baseline81.21

F1

Value93.17

Baseline67.25

F1

Value91.59

Baseline61.79

Resource: annotated records

Value24

Baseline4000

Who Should Care

What To Try In 7 Days

Run AgentAuditor on a held-out subset of your agent logs to compare F1/recall vs current rule-based checks.

Build a 20–100‑shot reasoning memory from high-quality examples (paper shows ~24 shots sufficed) and measure gains on ambiguous cases.

Use ASSEBench’s strict/lenient splits to quantify how sensitive your current evaluator is to borderline cases and tune thresholds accordingly.

Agent Features

Memory

  • structured feature memory (scenario,risk,behavior)
  • vectorized embeddings
  • reasoning memory of CoT traces

Planning

  • few-shot CoT prompting
  • multi-shot retrieval-augmented reasoning

Tool Use

  • tool-call sequence analysis
  • environment action reasoning

Frameworks

  • FINCH clustering
  • Nomic-Embed-Text-v1.5 embeddings
  • multi-stage RAG pipeline

Is Agentic

true

Architectures

  • LLM
  • RAG
  • CoT

Collaboration

  • human-in-the-loop annotation for dataset creation

Optimization Features

Token Efficiency

  • reduces need for large fine-tuning datasets (24 vs 4,000 labels)

Infra Optimization

  • supports smaller models to reach larger-model performance via memory augmentation

System Optimization

  • PCA for dimensionality reduction before clustering

Training Optimization

  • training-free evaluation (no fine-tuning required)

Inference Optimization

  • few-shot CoT prompts assembled from k retrieved examples
  • multi-stage reranking to pick top-k CoT examples

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Performance depends on quality of memory shot labels and CoT correctness.
  • CoT increases inference cost and latency relative to single-shot prompts.
  • Several hyperparameters are heuristic and tuned empirically; universality not fully proven.
  • ASSEBench and experiments are English-only and use binary safe/unsafe labels.

When Not To Use

  • When you cannot afford higher inference latency or CoT cost.
  • When you lack a small set of high-quality annotated examples to seed reasoning memory.
  • When you need multilingual coverage not supported by current memory and datasets.
  • When a multi-class or graded safety taxonomy (not binary) is required without further extensions.

Failure Modes

  • Corrupted or poisoned reasoning shots can mislead final judgments (white-box attacks have impact).
  • If >~50% of reasoning-memory labels are wrong, performance can drop below baseline.
  • Over-analysis by CoT can increase false positives when used without representative examples.
  • Poor embedding or retrieval quality yields weaker gains (random-shot experiments show lower improvement).

Core Entities

Models

  • Gemini-2.0-Flash-thinking
  • GPT-4o
  • GPT-4.1
  • Claude-3.5-sonnet
  • Deepseek-v3
  • QwQ-32B
  • Qwen-2.5-32B
  • Qwen-2.5-7B
  • Llama-3.1-8B
  • ShieldAgent (fine-tuned Qwen-2.5-7B)
  • Llama-Guard-3

Metrics

  • F1
  • Accuracy
  • Recall
  • Precision
  • ASR
  • Refusal Rate

Datasets

  • ASSEBench-Security
  • ASSEBench-Safety
  • ASSEBench-Strict
  • ASSEBench-Lenient
  • R-Judge
  • AgentHarm
  • AgentSafetyBench
  • AgentSecurityBench
  • AgentDojo

Benchmarks

  • R-Judge
  • AgentHarm
  • AgentSafetyBench
  • AgentSecurityBench
  • AgentDojo

Context Entities

Models

  • Gemini-2.0-Flash-thinking (reasoning-optimized)
  • GPT-o3-mini
  • QwQ-32B (open)
  • Qwen family (open)

Metrics

  • F1, Acc (scaled 0-100)
  • Recall prioritized for safety

Datasets

  • ToolEmu
  • AgentDojo
  • AgentHarm
  • AgentSafetyBench
  • AgentSecurityBench

Benchmarks

  • R-Judge (existing evaluator benchmark)