R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and well-validated (human labels, GPT-4 scorer agreement), but models still struggle; apply as an evaluation and guard-training dataset, not a single deployment fix.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 45%

Production readiness: 40%

Novelty: 70%

Authors

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agent-enabled features can cause real harms (privacy, finance, physical); off-the-shelf LLMs often miss these risks, so companies must add a tuned safety monitor and human checks before letting agents act autonomously.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Founder

Summary TLDR

R-Judge is a human-curated benchmark of 569 multi-turn agent interaction records labeled for safety. It asks an LLM to read an agent trace, describe any safety risk, and output safe/unsafe. On this task GPT-4o reaches F1=74.45% while most popular models perform near or below a random baseline (F1≈51%). Simple prompting helps little; targeted fine-tuning (a 'guard' model) substantially improves judgment. The paper shows risk awareness for agents needs both world knowledge and scenario reasoning, and recommends using a dedicated, tuned monitor rather than rely on zero-shot prompting.

Problem Statement

Current safety tests focus on text content. LLM agents act in environments and can cause behavioral harms (privacy, financial loss, physical risk). We lack a practical benchmark that tests whether LLMs can read agent action traces and reliably identify real-world safety risks. This gap prevents measuring and improving risk awareness for agent deployment.

Main Contribution

A new dataset, R-Judge: 569 human-annotated multi-turn agent interaction records with binary safety labels and structured risk descriptions (Motivation, Trigger, Outcome).

A two-stage evaluation protocol: open-ended risk identification (scored by GPT-4) and binary safety judgment (F1, Recall, Specificity).

Key Findings

R-Judge covers 569 agent interaction records across 5 categories and 27 scenarios with 10 risk types.

Numbers569 records; 5 categories; 27 scenarios; 10 risk types

Practical UseUse this dataset to test agent monitors across common personal-agent scenarios instead of content-only safety tests.

Evidence RefSection 3.4, Table 5, Figure 1

About half the dataset is unsafe and records are short multi-turn traces.

Numbers52.7% unsafe; avg 2.6 turns; avg 206 words per record

Practical UseBenchmarking should focus on short multi-turn traces; monitoring must be fast and handle brief but meaningful context.

Evidence RefSection 3.4, Table 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Safety judgment F1 (best model)	74.45%	Random F1 51.32%	+23.13 pp	R-Judge (all)	Table 1 reports GPT-4o F1=74.45% vs Random 51.32%	Table 1
Risk identification Effectiveness (best model, analysis relevance)	88 (normalized 0–100)	Random 0	+88	R-Judge (pairwise risk identification)	GPT-4o Effectiveness 88 under Zero-Shot-CoT in Section 5 table	Section 5 Table

What To Try In 7 Days

Run R-Judge on your agent logs to quantify current risk-detection gaps.

Deploy a small fine-tuned monitor model (guard) to flag unsafe traces before actions execute.

Add a human-in-the-loop for flagged cases and log decisions to expand training data.

Agent Features

Memory

short multi-turn context (avg 2.6 turns)

Planning

multi-turn think-act-feedback planning

Tool Use

web searchcode/terminalcloud app APIs (Dropbox, Evernote, smart locks)

Frameworks

ReAct

Is Agentic

Yes

Architectures

LLM controllerReAct agent framework

Optimization Features

Training Optimization

fine-tuning on interaction traces (Llama Guard)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Lordog/R-Judge

Data URLs

https://github.com/Lordog/R-Judge

Risks & Boundaries

Limitations

Dataset size is moderate (569 cases) and focused on personal-agent scenarios, so coverage of enterprise settings may be limited.

Human annotation bias and limited annotator pool may affect borderline cases despite cross-checks.

When Not To Use

As the sole defense for high-stakes automated actions without human oversight.

To claim full safety across all agent domains beyond the five covered categories.

Failure Modes

False positives from overly cautious models (mislabel safe traces as unsafe).

False negatives from missed scenario-specific reasoning (miss hidden risks).

Core Entities

Models

GPT-4oChatGPTMeta-Llama-3-8B-InstructLlama-2-13b-chat-hfLlama-2-7b-chat-hfVicuna-13b-v1.5Vicuna-13b-v1.5-16kVicuna-7b-v1.5Vicuna-7b-v1.5-16kMistral-7B-Instruct-v0.2Mistral-7B-Instruct-v0.3LlamaGuard-7bMeta-Llama-Guard-2-8B

Metrics

F1RecallSpecificityEffectiveness (GPT-4 scored relevance)

Datasets

R-Judge

Benchmarks

R-Judge (safety risk awareness for agents)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

R-Judge covers 569 agent interaction records across 5 categories and 27 scenarios with 10 risk types.

About half the dataset is unsafe and records are short multi-turn traces.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding