R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

January 18, 20248 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and well-validated (human labels, GPT-4 scorer agreement), but models still struggle; apply as an evaluation and guard-training dataset, not a single deployment fix.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 45%

Production readiness: 40%

Novelty: 70%

Authors

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agent-enabled features can cause real harms (privacy, finance, physical); off-the-shelf LLMs often miss these risks, so companies must add a tuned safety monitor and human checks before letting agents act autonomously.

Who Should Care

Summary TLDR

R-Judge is a human-curated benchmark of 569 multi-turn agent interaction records labeled for safety. It asks an LLM to read an agent trace, describe any safety risk, and output safe/unsafe. On this task GPT-4o reaches F1=74.45% while most popular models perform near or below a random baseline (F1≈51%). Simple prompting helps little; targeted fine-tuning (a 'guard' model) substantially improves judgment. The paper shows risk awareness for agents needs both world knowledge and scenario reasoning, and recommends using a dedicated, tuned monitor rather than rely on zero-shot prompting.

Problem Statement

Current safety tests focus on text content. LLM agents act in environments and can cause behavioral harms (privacy, financial loss, physical risk). We lack a practical benchmark that tests whether LLMs can read agent action traces and reliably identify real-world safety risks. This gap prevents measuring and improving risk awareness for agent deployment.

Main Contribution

A new dataset, R-Judge: 569 human-annotated multi-turn agent interaction records with binary safety labels and structured risk descriptions (Motivation, Trigger, Outcome).

A two-stage evaluation protocol: open-ended risk identification (scored by GPT-4) and binary safety judgment (F1, Recall, Specificity).

Key Findings

R-Judge covers 569 agent interaction records across 5 categories and 27 scenarios with 10 risk types.

Numbers569 records; 5 categories; 27 scenarios; 10 risk types

Practical UseUse this dataset to test agent monitors across common personal-agent scenarios instead of content-only safety tests.

Evidence RefSection 3.4, Table 5, Figure 1

About half the dataset is unsafe and records are short multi-turn traces.

Numbers52.7% unsafe; avg 2.6 turns; avg 206 words per record

Practical UseBenchmarking should focus on short multi-turn traces; monitoring must be fast and handle brief but meaningful context.

Evidence RefSection 3.4, Table 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Safety judgment F1 (best model)74.45%Random F1 51.32%+23.13 ppR-Judge (all)Table 1 reports GPT-4o F1=74.45% vs Random 51.32%Table 1
Risk identification Effectiveness (best model, analysis relevance)88 (normalized 0100)Random 0+88R-Judge (pairwise risk identification)GPT-4o Effectiveness 88 under Zero-Shot-CoT in Section 5 tableSection 5 Table

What To Try In 7 Days

Run R-Judge on your agent logs to quantify current risk-detection gaps.

Deploy a small fine-tuned monitor model (guard) to flag unsafe traces before actions execute.

Add a human-in-the-loop for flagged cases and log decisions to expand training data.

Agent Features

Memory
short multi-turn context (avg 2.6 turns)
Planning
multi-turn think-act-feedback planning
Tool Use
web searchcode/terminalcloud app APIs (Dropbox, Evernote, smart locks)
Frameworks
ReAct
Is Agentic

Yes

Architectures
LLM controllerReAct agent framework

Optimization Features

Training Optimization
fine-tuning on interaction traces (Llama Guard)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Dataset size is moderate (569 cases) and focused on personal-agent scenarios, so coverage of enterprise settings may be limited.

Human annotation bias and limited annotator pool may affect borderline cases despite cross-checks.

When Not To Use

As the sole defense for high-stakes automated actions without human oversight.

To claim full safety across all agent domains beyond the five covered categories.

Failure Modes

False positives from overly cautious models (mislabel safe traces as unsafe).

False negatives from missed scenario-specific reasoning (miss hidden risks).

Core Entities

Models

GPT-4oChatGPTMeta-Llama-3-8B-InstructLlama-2-13b-chat-hfLlama-2-7b-chat-hfVicuna-13b-v1.5Vicuna-13b-v1.5-16kVicuna-7b-v1.5Vicuna-7b-v1.5-16kMistral-7B-Instruct-v0.2Mistral-7B-Instruct-v0.3LlamaGuard-7bMeta-Llama-Guard-2-8B

Metrics

F1RecallSpecificityEffectiveness (GPT-4 scored relevance)

Datasets

R-Judge

Benchmarks

R-Judge (safety risk awareness for agents)