Overview
The benchmark is ready for integration as a diagnostic test; the automated pipeline and public assets enable reuse, but moderator bias and a small percent of debatable prompts mean expert review is advisable for final safety-critical decisions.
Citations7
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Over-refusal hurts user experience: safety tuning that increases toxic blocking can reduce helpfulness and raise support costs. Measure both safety and false refusals to avoid harming product usability.
Who Should Care
Summary TLDR
The paper introduces OR-Bench, a large automated benchmark to measure over-refusal — when safety-tuned LLMs refuse safe, answerable prompts. The authors generate 80,000 candidate safe-but-borderline prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts. They moderate generated prompts with an LLM ensemble and evaluate 32 models across 8 families. Key findings: safety and over-refusal strongly correlate (Spearman 0.89), newer models often reduce over-refusal but trade safety, and common defenses or system prompts can reduce toxic acceptance while increasing false refusals. The dataset and code are public.
Problem Statement
Safety tuning reduces harmful outputs but can make models refuse harmless, legitimate requests. There was no large, automated benchmark to measure this 'over-refusal' at scale, blocking systematic study and improvement of the safety-helpfulness trade-off.
Main Contribution
A fully automated pipeline to convert toxic seeds into safe but borderline prompts designed to trigger over-refusal.
OR-Bench dataset: 80,000 over-refusal prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts.
Key Findings
Safety and over-refusal are highly correlated.
OR-Bench scale and composition.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Spearman correlation (safety vs over-refusal) | 0.89 | — | — | OR-Bench-Hard-1K | Section 4.2, Figure 1 | Figure 1 |
| Dataset size | 80,000 over-refusal prompts; 1,000 hard; 600 toxic | — | — | OR-Bench | Abstract, Section 3.3 | Abstract |
What To Try In 7 Days
Run OR-Bench-Hard-1K on your deployed model to spot stubborn false refusals.
Compare changes from any safety tweak by plotting toxic rejection vs over-refusal (look for top-left improvement).
Test system prompts and common defenses on a small set to quantify how many benign queries become refused.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Moderation relies on LLM ensemble; some toxic or debatable prompts may slip through.
Hard-1K amplifies differences and may not reflect everyday usage distributions.
When Not To Use
As a replacement for human review in high-risk deployments
To evaluate red-teaming or jailbreak strength (different goal)
Failure Modes
Judge bias: moderators drawn from LLM families may favor certain safety patterns.
Dataset contains borderline or ambiguous prompts that different cultures/legal systems may view differently.

