A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

June 20, 20248 min

Overview

Decision SnapshotReady For Pilot

The benchmark, human labels, and judge meta-eval provide strong evidence the dataset is practical for routine safety testing; fine-tuned 7B judges give a low-cost automation path.

Citations4

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 65%

Production readiness: 85%

Novelty: 42%

Authors

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SORRY-Bench lets product and risk teams measure whether a model will refuse harmful requests across many specific topics and prompt styles; this helps set provider and model selection policy and reduces surprise from prompt variants.

Who Should Care

Summary TLDR

SORRY-Bench builds a fine-grained, class-balanced safety-refusal benchmark: 44 risk categories, 440 base unsafe instructions, and 20 linguistic mutations that create 8.8K extra variants. The authors collect 7K+ human judgments and run a meta-evaluation of automated evaluators. Key findings: fine-tuned ~7B judges match GPT-4-level agreement (~80%+) at far lower cost, and 56 models show wide divergence in refusal behavior (fulfillment rates range roughly 6%–90%). The repo, data, and code are publicly hosted for reproducible evaluation.

Problem Statement

Existing safety-refusal evaluations are coarse, imbalanced, and ignore prompt variations and judge design. This prevents reliable, granular measurement of whether aligned LLMs will refuse unsafe user requests across many realistic prompt styles and languages.

Main Contribution

A fine-grained 44-class safety taxonomy and a class-balanced base dataset of 440 unsafe instructions (10 per class).

20 linguistic mutations (questions, slang, encodings, 5 languages) that produce 8.8K mutated unsafe prompts for robustness testing.

Key Findings

SORRY-Bench provides balanced coverage across 44 fine-grained safety categories.

Numbers44 categories; 440 base instructions (10 per class).

Practical UseUse SORRY-Bench when you need per-topic safety signals instead of coarse aggregated categories.

Evidence Ref§2.2–§2.3

Prompt wording and format matter: 20 mutations yield 8.8K extra variants and change model behavior.

Numbers20 mutations → 20×440 = 8,800 mutated prompts; mutations shift fulfillment by ±266% on examples.

Practical UseEvaluate models with varied phrasings and languages; single-format tests miss many failure modes.

Evidence Ref§2.4; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Human judgment dataset size7,040 annotationsHuman judge dataset (ID+OOD)Collected 440*(8 ID + 8 OOD) judgments; §3.2§3.2
Automated judge agreement with humans (best fine-tuned)83.8% Cohen Kappa (GPT-3.5 + fine-tuned)GPT-4o prompt-only 78.9%+4.9 ppMeta-eval test splitTable 1 / Table 6; fine-tuning raises agreement to ~81–84% for several models§3.3; Table 1

What To Try In 7 Days

Run SORRY-Bench base set (440 prompts) on candidate model to get per-category fulfillment rates.

Fine-tune a 7B judge on a small human-labeled sample (≈2.6K) to automate routine safety checks cheaply.

Run 5–10 linguistic mutations (e.g., question style, one low-resource language, a persuasion template) to probe robustness quickly.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Binary refusal labeling only; does not score degrees of harmfulness.

Does not cover multi-category compound prompts or many worst-case jailbreak attacks.

When Not To Use

If you need graded harmfulness scores rather than binary refusal.

When evaluating extreme adversarial jailbreaks not represented by the 20 mutations.

Failure Modes

Judge misclassifies nuanced responses with disclaimers as refusals or bullet lists as fulfillments (J.4 examples).

Fine-tuned judges may overfit to SORRY-Bench style and miss out-of-distribution jailbroken patterns.

Core Entities

Models

GPT-4oGPT-3.5-turboClaude-2Gemini-1.5Llama-3Llama-2Mistral-7b-instructGemmaVicunaZephyrDolphinMixtral

Metrics

fulfillment rate (fraction of responses that assist unsafe request)Cohen Kappa agreementrefusal recallfulfillment recalltime cost per evaluation pass

Datasets

SORRY-BenchAdvBenchHarmBenchSALAD-BenchALERTStrongREJECT

Benchmarks

SORRY-BenchHarmBenchSALAD-BenchALERTStrongREJECT