Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
7
Why It Matters For Business
Over-refusal hurts user experience: safety tuning that increases toxic blocking can reduce helpfulness and raise support costs. Measure both safety and false refusals to avoid harming product usability.
Summary TLDR
The paper introduces OR-Bench, a large automated benchmark to measure over-refusal — when safety-tuned LLMs refuse safe, answerable prompts. The authors generate 80,000 candidate safe-but-borderline prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts. They moderate generated prompts with an LLM ensemble and evaluate 32 models across 8 families. Key findings: safety and over-refusal strongly correlate (Spearman 0.89), newer models often reduce over-refusal but trade safety, and common defenses or system prompts can reduce toxic acceptance while increasing false refusals. The dataset and code are public.
Problem Statement
Safety tuning reduces harmful outputs but can make models refuse harmless, legitimate requests. There was no large, automated benchmark to measure this 'over-refusal' at scale, blocking systematic study and improvement of the safety-helpfulness trade-off.
Main Contribution
A fully automated pipeline to convert toxic seeds into safe but borderline prompts designed to trigger over-refusal.
OR-Bench dataset: 80,000 over-refusal prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts.
Evaluation of 32 LLMs (8 families) showing a strong safety vs. over-refusal trade-off and analyses of defenses, system prompts, and temperature effects.
Open release of datasets and code (Hugging Face + GitHub).
Key Findings
Safety and over-refusal are highly correlated.
OR-Bench scale and composition.
Different model families trade safety and helpfulness differently.
Moderation and evaluation methods are accurate vs human labels.
Keyword matching closely approximates LLM judging for large-scale evaluation.
Results
Spearman correlation (safety vs over-refusal)
Dataset size
Extreme over-refusal (example)
LLM-judge vs keyword mismatch
Who Should Care
What To Try In 7 Days
Run OR-Bench-Hard-1K on your deployed model to spot stubborn false refusals.
Compare changes from any safety tweak by plotting toxic rejection vs over-refusal (look for top-left improvement).
Test system prompts and common defenses on a small set to quantify how many benign queries become refused.
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Moderation relies on LLM ensemble; some toxic or debatable prompts may slip through.
- Hard-1K amplifies differences and may not reflect everyday usage distributions.
- Pipeline is one method for generating borderline prompts; others may expose different failure modes.
When Not To Use
- As a replacement for human review in high-risk deployments
- To evaluate red-teaming or jailbreak strength (different goal)
- To measure model factual quality or downstream task performance
Failure Modes
- Judge bias: moderators drawn from LLM families may favor certain safety patterns.
- Dataset contains borderline or ambiguous prompts that different cultures/legal systems may view differently.
- Optimizing only for OR-Bench metrics could push models to overfit to the benchmark style.
Core Entities
Models
- Claude-2.1
- Claude-3 (haiku/sonnet/opus)
- Claude-3.5
- Gemini-1.0-pro
- Gemini-1.5-flash
- Gemini-1.5-pro
- Gemma series
- GPT-3.5-turbo-0301
- GPT-3.5-turbo-0613
- GPT-3.5-turbo-0125
- GPT-4-0125-preview
- GPT-4-turbo-2024-04-09
- GPT-4o
- GPT-4o-08-06
- Llama-2 (7b/13b/70b)
- Llama-3 (8b/70b/3.1 variants)
- Mistral (small/medium/large)
- Qwen-1.5 (7B/32B/72B)
- Gemma-2
Metrics
- over-refusal rate (rejection of safe prompts)
- toxic-prompt rejection (safety)
- Spearman correlation between safety and over-refusal
- BERTScore / diversity measures
- keyword-matching discrepancy vs LLM judge
Datasets
- OR-Bench-80K
- OR-Bench-Hard-1K
- OR-Bench-Toxic
- AdvBench
- XSTest
Benchmarks
- OR-Bench
- AdvBench
- XSTest

