Overview
Production Readiness
0.4
Novelty Score
0.7
Cost Impact Score
0.45
Citation Count
2
Why It Matters For Business
Agent-enabled features can cause real harms (privacy, finance, physical); off-the-shelf LLMs often miss these risks, so companies must add a tuned safety monitor and human checks before letting agents act autonomously.
Summary TLDR
R-Judge is a human-curated benchmark of 569 multi-turn agent interaction records labeled for safety. It asks an LLM to read an agent trace, describe any safety risk, and output safe/unsafe. On this task GPT-4o reaches F1=74.45% while most popular models perform near or below a random baseline (F1≈51%). Simple prompting helps little; targeted fine-tuning (a 'guard' model) substantially improves judgment. The paper shows risk awareness for agents needs both world knowledge and scenario reasoning, and recommends using a dedicated, tuned monitor rather than rely on zero-shot prompting.
Problem Statement
Current safety tests focus on text content. LLM agents act in environments and can cause behavioral harms (privacy, financial loss, physical risk). We lack a practical benchmark that tests whether LLMs can read agent action traces and reliably identify real-world safety risks. This gap prevents measuring and improving risk awareness for agent deployment.
Main Contribution
A new dataset, R-Judge: 569 human-annotated multi-turn agent interaction records with binary safety labels and structured risk descriptions (Motivation, Trigger, Outcome).
A two-stage evaluation protocol: open-ended risk identification (scored by GPT-4) and binary safety judgment (F1, Recall, Specificity).
An empirical study of 11 LLMs showing large gaps in risk awareness, that prompting often fails, and that fine-tuning a monitor can materially improve safety judgments.
Key Findings
R-Judge covers 569 agent interaction records across 5 categories and 27 scenarios with 10 risk types.
About half the dataset is unsafe and records are short multi-turn traces.
Top model performance leaves room for improvement: GPT-4o is best at safety judgment (F1=74.45%) while most models are near or below random baseline.
Fine-tuning a guard model raises judgment accuracy substantially compared with untuned models and prompting.
Simple prompting tricks (few-shot, risk-type hints) do not reliably improve performance and can harm it in some cases.
The automatic risk-identification scorer (GPT-4) aligns well with humans.
Results
Safety judgment F1 (best model)
Risk identification Effectiveness (best model, analysis relevance)
Fine-tuned guard F1
Human-GPT4 scorer agreement
Who Should Care
What To Try In 7 Days
Run R-Judge on your agent logs to quantify current risk-detection gaps.
Deploy a small fine-tuned monitor model (guard) to flag unsafe traces before actions execute.
Add a human-in-the-loop for flagged cases and log decisions to expand training data.
Agent Features
Memory
- short multi-turn context (avg 2.6 turns)
Planning
- multi-turn think-act-feedback planning
Tool Use
- web search
- code/terminal
- cloud app APIs (Dropbox, Evernote, smart locks)
Frameworks
- ReAct
Is Agentic
true
Architectures
- LLM controller
- ReAct agent framework
Optimization Features
Training Optimization
- fine-tuning on interaction traces (Llama Guard)
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Dataset size is moderate (569 cases) and focused on personal-agent scenarios, so coverage of enterprise settings may be limited.
- Human annotation bias and limited annotator pool may affect borderline cases despite cross-checks.
- Automatic scoring uses GPT-4; while validated, automated judgments still need human review for critical deployments.
When Not To Use
- As the sole defense for high-stakes automated actions without human oversight.
- To claim full safety across all agent domains beyond the five covered categories.
- For auditing legal or regulatory compliance without expert review.
Failure Modes
- False positives from overly cautious models (mislabel safe traces as unsafe).
- False negatives from missed scenario-specific reasoning (miss hidden risks).
- Coverage gaps for scenarios not in R-Judge (enterprise systems, industrial control).
- Context length limits causing few-shot prompts to fail or be truncated.
Core Entities
Models
- GPT-4o
- ChatGPT
- Meta-Llama-3-8B-Instruct
- Llama-2-13b-chat-hf
- Llama-2-7b-chat-hf
- Vicuna-13b-v1.5
- Vicuna-13b-v1.5-16k
- Vicuna-7b-v1.5
- Vicuna-7b-v1.5-16k
- Mistral-7B-Instruct-v0.2
- Mistral-7B-Instruct-v0.3
- LlamaGuard-7b
- Meta-Llama-Guard-2-8B
Metrics
- F1
- Recall
- Specificity
- Effectiveness (GPT-4 scored relevance)
Datasets
- R-Judge
Benchmarks
- R-Judge (safety risk awareness for agents)

