OR-Bench: a large, automated dataset to measure when LLMs wrongly refuse safe prompts

Overview

Decision SnapshotReady For Pilot

The benchmark is ready for integration as a diagnostic test; the automated pipeline and public assets enable reuse, but moderator bias and a small percent of debatable prompts mean expert review is advisable for final safety-critical decisions.

Citations7

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Over-refusal hurts user experience: safety tuning that increases toxic blocking can reduce helpfulness and raise support costs. Measure both safety and false refusals to avoid harming product usability.

Who Should Care

Product Manager ML Engineer CTO CEO Founder Data Scientist

Summary TLDR

The paper introduces OR-Bench, a large automated benchmark to measure over-refusal — when safety-tuned LLMs refuse safe, answerable prompts. The authors generate 80,000 candidate safe-but-borderline prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts. They moderate generated prompts with an LLM ensemble and evaluate 32 models across 8 families. Key findings: safety and over-refusal strongly correlate (Spearman 0.89), newer models often reduce over-refusal but trade safety, and common defenses or system prompts can reduce toxic acceptance while increasing false refusals. The dataset and code are public.

Problem Statement

Safety tuning reduces harmful outputs but can make models refuse harmless, legitimate requests. There was no large, automated benchmark to measure this 'over-refusal' at scale, blocking systematic study and improvement of the safety-helpfulness trade-off.

Main Contribution

A fully automated pipeline to convert toxic seeds into safe but borderline prompts designed to trigger over-refusal.

OR-Bench dataset: 80,000 over-refusal prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts.

Key Findings

Safety and over-refusal are highly correlated.

NumbersSpearman ρ = 0.89 (OR-Bench-Hard-1K)

Practical UseWhen you harden a model to block toxic prompts, expect more false refusals; measure both metrics together when tuning safety.

Evidence RefSection 4.2, Figure 1

OR-Bench scale and composition.

Numbers80,000 safe prompts; 1,000 hard prompts; 600 toxic prompts

Practical UseUse the 80K set for broad testing, the 1K hard set to profile stubborn over-refusal behavior, and the 600 toxic set to confirm safety.

Evidence RefAbstract, Section 3.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Spearman correlation (safety vs over-refusal)	0.89	—	—	OR-Bench-Hard-1K	Section 4.2, Figure 1	Figure 1
Dataset size	80,000 over-refusal prompts; 1,000 hard; 600 toxic	—	—	OR-Bench	Abstract, Section 3.3	Abstract

What To Try In 7 Days

Run OR-Bench-Hard-1K on your deployed model to spot stubborn false refusals.

Compare changes from any safety tweak by plotting toxic rejection vs over-refusal (look for top-left improvement).

Test system prompts and common defenses on a small set to quantify how many benign queries become refused.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/justincui03/or-bench

Data URLs

https://huggingface.co/benchllms

Risks & Boundaries

Limitations

Moderation relies on LLM ensemble; some toxic or debatable prompts may slip through.

Hard-1K amplifies differences and may not reflect everyday usage distributions.

When Not To Use

As a replacement for human review in high-risk deployments

To evaluate red-teaming or jailbreak strength (different goal)

Failure Modes

Judge bias: moderators drawn from LLM families may favor certain safety patterns.

Dataset contains borderline or ambiguous prompts that different cultures/legal systems may view differently.

Core Entities

Models

Claude-2.1Claude-3 (haiku/sonnet/opus)Claude-3.5Gemini-1.0-proGemini-1.5-flashGemini-1.5-proGemma seriesGPT-3.5-turbo-0301GPT-3.5-turbo-0613GPT-3.5-turbo-0125GPT-4-0125-previewGPT-4-turbo-2024-04-09GPT-4oGPT-4o-08-06Llama-2 (7b/13b/70b)Llama-3 (8b/70b/3.1 variants)Mistral (small/medium/large)Qwen-1.5 (7B/32B/72B)Gemma-2

Metrics

over-refusal rate (rejection of safe prompts)toxic-prompt rejection (safety)Spearman correlation between safety and over-refusalBERTScore / diversity measureskeyword-matching discrepancy vs LLM judge

Datasets

OR-Bench-80KOR-Bench-Hard-1KOR-Bench-ToxicAdvBenchXSTest

Benchmarks

OR-BenchAdvBenchXSTest

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Safety and over-refusal are highly correlated.

OR-Bench scale and composition.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding