R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

January 18, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.45

Citation Count

2

Authors

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu

Links

Abstract / PDF

Why It Matters For Business

Agent-enabled features can cause real harms (privacy, finance, physical); off-the-shelf LLMs often miss these risks, so companies must add a tuned safety monitor and human checks before letting agents act autonomously.

Summary TLDR

R-Judge is a human-curated benchmark of 569 multi-turn agent interaction records labeled for safety. It asks an LLM to read an agent trace, describe any safety risk, and output safe/unsafe. On this task GPT-4o reaches F1=74.45% while most popular models perform near or below a random baseline (F1≈51%). Simple prompting helps little; targeted fine-tuning (a 'guard' model) substantially improves judgment. The paper shows risk awareness for agents needs both world knowledge and scenario reasoning, and recommends using a dedicated, tuned monitor rather than rely on zero-shot prompting.

Problem Statement

Current safety tests focus on text content. LLM agents act in environments and can cause behavioral harms (privacy, financial loss, physical risk). We lack a practical benchmark that tests whether LLMs can read agent action traces and reliably identify real-world safety risks. This gap prevents measuring and improving risk awareness for agent deployment.

Main Contribution

A new dataset, R-Judge: 569 human-annotated multi-turn agent interaction records with binary safety labels and structured risk descriptions (Motivation, Trigger, Outcome).

A two-stage evaluation protocol: open-ended risk identification (scored by GPT-4) and binary safety judgment (F1, Recall, Specificity).

An empirical study of 11 LLMs showing large gaps in risk awareness, that prompting often fails, and that fine-tuning a monitor can materially improve safety judgments.

Key Findings

R-Judge covers 569 agent interaction records across 5 categories and 27 scenarios with 10 risk types.

Numbers569 records; 5 categories; 27 scenarios; 10 risk types

About half the dataset is unsafe and records are short multi-turn traces.

Numbers52.7% unsafe; avg 2.6 turns; avg 206 words per record

Top model performance leaves room for improvement: GPT-4o is best at safety judgment (F1=74.45%) while most models are near or below random baseline.

NumbersGPT-4o F1=74.45%; random F1=51.32%; many models < random

Fine-tuning a guard model raises judgment accuracy substantially compared with untuned models and prompting.

NumbersMeta-Llama-Guard-2-8B F1=71.84% (vs Llama-2 baseline 24.14% in same comparisons)

Simple prompting tricks (few-shot, risk-type hints) do not reliably improve performance and can harm it in some cases.

NumbersChatGPT F1: Zero-Shot-CoT 44.96% → Few-Shot-CoT 20.06%; mixed effects in Table 2

The automatic risk-identification scorer (GPT-4) aligns well with humans.

NumbersPearson r = 0.91 on 50 samples between GPT-4 scorer and human annotators

Results

Safety judgment F1 (best model)

Value74.45%

BaselineRandom F1 51.32%

Risk identification Effectiveness (best model, analysis relevance)

Value88 (normalized 0–100)

BaselineRandom 0

Fine-tuned guard F1

Value71.84%

BaselineLlama-2-7b-chat-hf F1 24.14% (same-family baseline)

Human-GPT4 scorer agreement

ValuePearson r = 0.91

Who Should Care

What To Try In 7 Days

Run R-Judge on your agent logs to quantify current risk-detection gaps.

Deploy a small fine-tuned monitor model (guard) to flag unsafe traces before actions execute.

Add a human-in-the-loop for flagged cases and log decisions to expand training data.

Agent Features

Memory

  • short multi-turn context (avg 2.6 turns)

Planning

  • multi-turn think-act-feedback planning

Tool Use

  • web search
  • code/terminal
  • cloud app APIs (Dropbox, Evernote, smart locks)

Frameworks

  • ReAct

Is Agentic

true

Architectures

  • LLM controller
  • ReAct agent framework

Optimization Features

Training Optimization

  • fine-tuning on interaction traces (Llama Guard)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Dataset size is moderate (569 cases) and focused on personal-agent scenarios, so coverage of enterprise settings may be limited.
  • Human annotation bias and limited annotator pool may affect borderline cases despite cross-checks.
  • Automatic scoring uses GPT-4; while validated, automated judgments still need human review for critical deployments.

When Not To Use

  • As the sole defense for high-stakes automated actions without human oversight.
  • To claim full safety across all agent domains beyond the five covered categories.
  • For auditing legal or regulatory compliance without expert review.

Failure Modes

  • False positives from overly cautious models (mislabel safe traces as unsafe).
  • False negatives from missed scenario-specific reasoning (miss hidden risks).
  • Coverage gaps for scenarios not in R-Judge (enterprise systems, industrial control).
  • Context length limits causing few-shot prompts to fail or be truncated.

Core Entities

Models

  • GPT-4o
  • ChatGPT
  • Meta-Llama-3-8B-Instruct
  • Llama-2-13b-chat-hf
  • Llama-2-7b-chat-hf
  • Vicuna-13b-v1.5
  • Vicuna-13b-v1.5-16k
  • Vicuna-7b-v1.5
  • Vicuna-7b-v1.5-16k
  • Mistral-7B-Instruct-v0.2
  • Mistral-7B-Instruct-v0.3
  • LlamaGuard-7b
  • Meta-Llama-Guard-2-8B

Metrics

  • F1
  • Recall
  • Specificity
  • Effectiveness (GPT-4 scored relevance)

Datasets

  • R-Judge

Benchmarks

  • R-Judge (safety risk awareness for agents)