Overview
The benchmark is practical and well-validated (human labels, GPT-4 scorer agreement), but models still struggle; apply as an evaluation and guard-training dataset, not a single deployment fix.
Citations2
Evidence Strength0.85
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 45%
Production readiness: 40%
Novelty: 70%
Why It Matters For Business
Agent-enabled features can cause real harms (privacy, finance, physical); off-the-shelf LLMs often miss these risks, so companies must add a tuned safety monitor and human checks before letting agents act autonomously.
Who Should Care
Summary TLDR
R-Judge is a human-curated benchmark of 569 multi-turn agent interaction records labeled for safety. It asks an LLM to read an agent trace, describe any safety risk, and output safe/unsafe. On this task GPT-4o reaches F1=74.45% while most popular models perform near or below a random baseline (F1≈51%). Simple prompting helps little; targeted fine-tuning (a 'guard' model) substantially improves judgment. The paper shows risk awareness for agents needs both world knowledge and scenario reasoning, and recommends using a dedicated, tuned monitor rather than rely on zero-shot prompting.
Problem Statement
Current safety tests focus on text content. LLM agents act in environments and can cause behavioral harms (privacy, financial loss, physical risk). We lack a practical benchmark that tests whether LLMs can read agent action traces and reliably identify real-world safety risks. This gap prevents measuring and improving risk awareness for agent deployment.
Main Contribution
A new dataset, R-Judge: 569 human-annotated multi-turn agent interaction records with binary safety labels and structured risk descriptions (Motivation, Trigger, Outcome).
A two-stage evaluation protocol: open-ended risk identification (scored by GPT-4) and binary safety judgment (F1, Recall, Specificity).
Key Findings
R-Judge covers 569 agent interaction records across 5 categories and 27 scenarios with 10 risk types.
About half the dataset is unsafe and records are short multi-turn traces.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Safety judgment F1 (best model) | 74.45% | Random F1 51.32% | +23.13 pp | R-Judge (all) | Table 1 reports GPT-4o F1=74.45% vs Random 51.32% | Table 1 |
| Risk identification Effectiveness (best model, analysis relevance) | 88 (normalized 0–100) | Random 0 | +88 | R-Judge (pairwise risk identification) | GPT-4o Effectiveness 88 under Zero-Shot-CoT in Section 5 table | Section 5 Table |
What To Try In 7 Days
Run R-Judge on your agent logs to quantify current risk-detection gaps.
Deploy a small fine-tuned monitor model (guard) to flag unsafe traces before actions execute.
Add a human-in-the-loop for flagged cases and log decisions to expand training data.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Dataset size is moderate (569 cases) and focused on personal-agent scenarios, so coverage of enterprise settings may be limited.
Human annotation bias and limited annotator pool may affect borderline cases despite cross-checks.
When Not To Use
As the sole defense for high-stakes automated actions without human oversight.
To claim full safety across all agent domains beyond the five covered categories.
Failure Modes
False positives from overly cautious models (mislabel safe traces as unsafe).
False negatives from missed scenario-specific reasoning (miss hidden risks).

