OR-Bench: a large, automated dataset to measure when LLMs wrongly refuse safe prompts

May 31, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

7

Authors

Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh

Links

Abstract / PDF

Why It Matters For Business

Over-refusal hurts user experience: safety tuning that increases toxic blocking can reduce helpfulness and raise support costs. Measure both safety and false refusals to avoid harming product usability.

Summary TLDR

The paper introduces OR-Bench, a large automated benchmark to measure over-refusal — when safety-tuned LLMs refuse safe, answerable prompts. The authors generate 80,000 candidate safe-but-borderline prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts. They moderate generated prompts with an LLM ensemble and evaluate 32 models across 8 families. Key findings: safety and over-refusal strongly correlate (Spearman 0.89), newer models often reduce over-refusal but trade safety, and common defenses or system prompts can reduce toxic acceptance while increasing false refusals. The dataset and code are public.

Problem Statement

Safety tuning reduces harmful outputs but can make models refuse harmless, legitimate requests. There was no large, automated benchmark to measure this 'over-refusal' at scale, blocking systematic study and improvement of the safety-helpfulness trade-off.

Main Contribution

A fully automated pipeline to convert toxic seeds into safe but borderline prompts designed to trigger over-refusal.

OR-Bench dataset: 80,000 over-refusal prompts across 10 categories, a 1,000-item hard subset, and 600 toxic prompts.

Evaluation of 32 LLMs (8 families) showing a strong safety vs. over-refusal trade-off and analyses of defenses, system prompts, and temperature effects.

Open release of datasets and code (Hugging Face + GitHub).

Key Findings

Safety and over-refusal are highly correlated.

NumbersSpearman ρ = 0.89 (OR-Bench-Hard-1K)

OR-Bench scale and composition.

Numbers80,000 safe prompts; 1,000 hard prompts; 600 toxic prompts

Different model families trade safety and helpfulness differently.

NumbersClaude-2.1 overall rejection 99.8% (Hard-1K); GPT-4 family average ~11.1% (Hard-1K)

Moderation and evaluation methods are accurate vs human labels.

NumbersEnsemble moderator accuracy 93% vs human expert 94% (~98.9% relative)

Keyword matching closely approximates LLM judging for large-scale evaluation.

NumbersDiscrepancies ≤2.4% (GPT-3.5-0125) and 1.2% (Llama-3-70b)

Results

Spearman correlation (safety vs over-refusal)

Value0.89

Dataset size

Value80,000 over-refusal prompts; 1,000 hard; 600 toxic

Extreme over-refusal (example)

ValueClaude-2.1 rejects 99.8% of Hard-1K safe prompts

LLM-judge vs keyword mismatch

Value≤2.4% discrepancy

Who Should Care

What To Try In 7 Days

Run OR-Bench-Hard-1K on your deployed model to spot stubborn false refusals.

Compare changes from any safety tweak by plotting toxic rejection vs over-refusal (look for top-left improvement).

Test system prompts and common defenses on a small set to quantify how many benign queries become refused.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Moderation relies on LLM ensemble; some toxic or debatable prompts may slip through.
  • Hard-1K amplifies differences and may not reflect everyday usage distributions.
  • Pipeline is one method for generating borderline prompts; others may expose different failure modes.

When Not To Use

  • As a replacement for human review in high-risk deployments
  • To evaluate red-teaming or jailbreak strength (different goal)
  • To measure model factual quality or downstream task performance

Failure Modes

  • Judge bias: moderators drawn from LLM families may favor certain safety patterns.
  • Dataset contains borderline or ambiguous prompts that different cultures/legal systems may view differently.
  • Optimizing only for OR-Bench metrics could push models to overfit to the benchmark style.

Core Entities

Models

  • Claude-2.1
  • Claude-3 (haiku/sonnet/opus)
  • Claude-3.5
  • Gemini-1.0-pro
  • Gemini-1.5-flash
  • Gemini-1.5-pro
  • Gemma series
  • GPT-3.5-turbo-0301
  • GPT-3.5-turbo-0613
  • GPT-3.5-turbo-0125
  • GPT-4-0125-preview
  • GPT-4-turbo-2024-04-09
  • GPT-4o
  • GPT-4o-08-06
  • Llama-2 (7b/13b/70b)
  • Llama-3 (8b/70b/3.1 variants)
  • Mistral (small/medium/large)
  • Qwen-1.5 (7B/32B/72B)
  • Gemma-2

Metrics

  • over-refusal rate (rejection of safe prompts)
  • toxic-prompt rejection (safety)
  • Spearman correlation between safety and over-refusal
  • BERTScore / diversity measures
  • keyword-matching discrepancy vs LLM judge

Datasets

  • OR-Bench-80K
  • OR-Bench-Hard-1K
  • OR-Bench-Toxic
  • AdvBench
  • XSTest

Benchmarks

  • OR-Bench
  • AdvBench
  • XSTest