OR-Bench: a large, automated dataset to measure when LLMs wrongly refuse safe prompts

May 31, 20247 min

Overview

Decision SnapshotReady For Pilot

The benchmark is ready for integration as a diagnostic test; the automated pipeline and public assets enable reuse, but moderator bias and a small percent of debatable prompts mean expert review is advisable for final safety-critical decisions.

Citations7

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Over-refusal hurts user experience: safety tuning that increases toxic blocking can reduce helpfulness and raise support costs. Measure both safety and false refusals to avoid harming product usability.

Who Should Care

Summary TLDR

The paper introduces OR-Bench, a large automated benchmark to measure over-refusal — when safety-tuned LLMs refuse safe, answerable prompts. The authors generate 80,000 candidate safe-but-borderline prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts. They moderate generated prompts with an LLM ensemble and evaluate 32 models across 8 families. Key findings: safety and over-refusal strongly correlate (Spearman 0.89), newer models often reduce over-refusal but trade safety, and common defenses or system prompts can reduce toxic acceptance while increasing false refusals. The dataset and code are public.

Problem Statement

Safety tuning reduces harmful outputs but can make models refuse harmless, legitimate requests. There was no large, automated benchmark to measure this 'over-refusal' at scale, blocking systematic study and improvement of the safety-helpfulness trade-off.

Main Contribution

A fully automated pipeline to convert toxic seeds into safe but borderline prompts designed to trigger over-refusal.

OR-Bench dataset: 80,000 over-refusal prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts.

Key Findings

Safety and over-refusal are highly correlated.

NumbersSpearman ρ = 0.89 (OR-Bench-Hard-1K)

Practical UseWhen you harden a model to block toxic prompts, expect more false refusals; measure both metrics together when tuning safety.

Evidence RefSection 4.2, Figure 1

OR-Bench scale and composition.

Numbers80,000 safe prompts; 1,000 hard prompts; 600 toxic prompts

Practical UseUse the 80K set for broad testing, the 1K hard set to profile stubborn over-refusal behavior, and the 600 toxic set to confirm safety.

Evidence RefAbstract, Section 3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Spearman correlation (safety vs over-refusal)0.89OR-Bench-Hard-1KSection 4.2, Figure 1Figure 1
Dataset size80,000 over-refusal prompts; 1,000 hard; 600 toxicOR-BenchAbstract, Section 3.3Abstract

What To Try In 7 Days

Run OR-Bench-Hard-1K on your deployed model to spot stubborn false refusals.

Compare changes from any safety tweak by plotting toxic rejection vs over-refusal (look for top-left improvement).

Test system prompts and common defenses on a small set to quantify how many benign queries become refused.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Moderation relies on LLM ensemble; some toxic or debatable prompts may slip through.

Hard-1K amplifies differences and may not reflect everyday usage distributions.

When Not To Use

As a replacement for human review in high-risk deployments

To evaluate red-teaming or jailbreak strength (different goal)

Failure Modes

Judge bias: moderators drawn from LLM families may favor certain safety patterns.

Dataset contains borderline or ambiguous prompts that different cultures/legal systems may view differently.

Core Entities

Models

Claude-2.1Claude-3 (haiku/sonnet/opus)Claude-3.5Gemini-1.0-proGemini-1.5-flashGemini-1.5-proGemma seriesGPT-3.5-turbo-0301GPT-3.5-turbo-0613GPT-3.5-turbo-0125GPT-4-0125-previewGPT-4-turbo-2024-04-09GPT-4oGPT-4o-08-06Llama-2 (7b/13b/70b)Llama-3 (8b/70b/3.1 variants)Mistral (small/medium/large)Qwen-1.5 (7B/32B/72B)Gemma-2

Metrics

over-refusal rate (rejection of safe prompts)toxic-prompt rejection (safety)Spearman correlation between safety and over-refusalBERTScore / diversity measureskeyword-matching discrepancy vs LLM judge

Datasets

OR-Bench-80KOR-Bench-Hard-1KOR-Bench-ToxicAdvBenchXSTest

Benchmarks

OR-BenchAdvBenchXSTest