A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Overview

Decision SnapshotReady For Pilot

The benchmark, human labels, and judge meta-eval provide strong evidence the dataset is practical for routine safety testing; fine-tuned 7B judges give a low-cost automation path.

Citations4

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 65%

Production readiness: 85%

Novelty: 42%

Authors

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SORRY-Bench lets product and risk teams measure whether a model will refuse harmful requests across many specific topics and prompt styles; this helps set provider and model selection policy and reduces surprise from prompt variants.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

SORRY-Bench builds a fine-grained, class-balanced safety-refusal benchmark: 44 risk categories, 440 base unsafe instructions, and 20 linguistic mutations that create 8.8K extra variants. The authors collect 7K+ human judgments and run a meta-evaluation of automated evaluators. Key findings: fine-tuned ~7B judges match GPT-4-level agreement (~80%+) at far lower cost, and 56 models show wide divergence in refusal behavior (fulfillment rates range roughly 6%–90%). The repo, data, and code are publicly hosted for reproducible evaluation.

Problem Statement

Existing safety-refusal evaluations are coarse, imbalanced, and ignore prompt variations and judge design. This prevents reliable, granular measurement of whether aligned LLMs will refuse unsafe user requests across many realistic prompt styles and languages.

Main Contribution

A fine-grained 44-class safety taxonomy and a class-balanced base dataset of 440 unsafe instructions (10 per class).

20 linguistic mutations (questions, slang, encodings, 5 languages) that produce 8.8K mutated unsafe prompts for robustness testing.

Key Findings

SORRY-Bench provides balanced coverage across 44 fine-grained safety categories.

Numbers44 categories; 440 base instructions (10 per class).

Practical UseUse SORRY-Bench when you need per-topic safety signals instead of coarse aggregated categories.

Evidence Ref§2.2–§2.3

Prompt wording and format matter: 20 mutations yield 8.8K extra variants and change model behavior.

Numbers20 mutations → 20×440 = 8,800 mutated prompts; mutations shift fulfillment by ±2–66% on examples.

Practical UseEvaluate models with varied phrasings and languages; single-format tests miss many failure modes.

Evidence Ref§2.4; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Human judgment dataset size	7,040 annotations	—	—	Human judge dataset (ID+OOD)	Collected 440*(8 ID + 8 OOD) judgments; §3.2	§3.2
Automated judge agreement with humans (best fine-tuned)	83.8% Cohen Kappa (GPT-3.5 + fine-tuned)	GPT-4o prompt-only 78.9%	+4.9 pp	Meta-eval test split	Table 1 / Table 6; fine-tuning raises agreement to ~81–84% for several models	§3.3; Table 1

What To Try In 7 Days

Run SORRY-Bench base set (440 prompts) on candidate model to get per-category fulfillment rates.

Fine-tune a 7B judge on a small human-labeled sample (≈2.6K) to automate routine safety checks cheaply.

Run 5–10 linguistic mutations (e.g., question style, one low-resource language, a persuasion template) to probe robustness quickly.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://sorry-bench.github.io

Data URLs

https://sorry-bench.github.io

Risks & Boundaries

Limitations

Binary refusal labeling only; does not score degrees of harmfulness.

Does not cover multi-category compound prompts or many worst-case jailbreak attacks.

When Not To Use

If you need graded harmfulness scores rather than binary refusal.

When evaluating extreme adversarial jailbreaks not represented by the 20 mutations.

Failure Modes

Judge misclassifies nuanced responses with disclaimers as refusals or bullet lists as fulfillments (J.4 examples).

Fine-tuned judges may overfit to SORRY-Bench style and miss out-of-distribution jailbroken patterns.

Core Entities

Models

GPT-4oGPT-3.5-turboClaude-2Gemini-1.5Llama-3Llama-2Mistral-7b-instructGemmaVicunaZephyrDolphinMixtral

Metrics

fulfillment rate (fraction of responses that assist unsafe request)Cohen Kappa agreementrefusal recallfulfillment recalltime cost per evaluation pass

Datasets

SORRY-BenchAdvBenchHarmBenchSALAD-BenchALERTStrongREJECT

Benchmarks

SORRY-BenchHarmBenchSALAD-BenchALERTStrongREJECT

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SORRY-Bench provides balanced coverage across 44 fine-grained safety categories.

Prompt wording and format matter: 20 mutations yield 8.8K extra variants and change model behavior.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A 150-control system-prompt governance layer (MDBC) that cuts aggregate LLM risk 36.8% vs. base.

Key finding