Overview
The benchmark, human labels, and judge meta-eval provide strong evidence the dataset is practical for routine safety testing; fine-tuned 7B judges give a low-cost automation path.
Citations4
Evidence Strength0.90
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 65%
Production readiness: 85%
Novelty: 42%
Why It Matters For Business
SORRY-Bench lets product and risk teams measure whether a model will refuse harmful requests across many specific topics and prompt styles; this helps set provider and model selection policy and reduces surprise from prompt variants.
Who Should Care
Summary TLDR
SORRY-Bench builds a fine-grained, class-balanced safety-refusal benchmark: 44 risk categories, 440 base unsafe instructions, and 20 linguistic mutations that create 8.8K extra variants. The authors collect 7K+ human judgments and run a meta-evaluation of automated evaluators. Key findings: fine-tuned ~7B judges match GPT-4-level agreement (~80%+) at far lower cost, and 56 models show wide divergence in refusal behavior (fulfillment rates range roughly 6%–90%). The repo, data, and code are publicly hosted for reproducible evaluation.
Problem Statement
Existing safety-refusal evaluations are coarse, imbalanced, and ignore prompt variations and judge design. This prevents reliable, granular measurement of whether aligned LLMs will refuse unsafe user requests across many realistic prompt styles and languages.
Main Contribution
A fine-grained 44-class safety taxonomy and a class-balanced base dataset of 440 unsafe instructions (10 per class).
20 linguistic mutations (questions, slang, encodings, 5 languages) that produce 8.8K mutated unsafe prompts for robustness testing.
Key Findings
SORRY-Bench provides balanced coverage across 44 fine-grained safety categories.
Prompt wording and format matter: 20 mutations yield 8.8K extra variants and change model behavior.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Human judgment dataset size | 7,040 annotations | — | — | Human judge dataset (ID+OOD) | Collected 440*(8 ID + 8 OOD) judgments; §3.2 | §3.2 |
| Automated judge agreement with humans (best fine-tuned) | 83.8% Cohen Kappa (GPT-3.5 + fine-tuned) | GPT-4o prompt-only 78.9% | +4.9 pp | Meta-eval test split | Table 1 / Table 6; fine-tuning raises agreement to ~81–84% for several models | §3.3; Table 1 |
What To Try In 7 Days
Run SORRY-Bench base set (440 prompts) on candidate model to get per-category fulfillment rates.
Fine-tune a 7B judge on a small human-labeled sample (≈2.6K) to automate routine safety checks cheaply.
Run 5–10 linguistic mutations (e.g., question style, one low-resource language, a persuasion template) to probe robustness quickly.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Binary refusal labeling only; does not score degrees of harmfulness.
Does not cover multi-category compound prompts or many worst-case jailbreak attacks.
When Not To Use
If you need graded harmfulness scores rather than binary refusal.
When evaluating extreme adversarial jailbreaks not represented by the 20 mutations.
Failure Modes
Judge misclassifies nuanced responses with disclaimers as refusals or bullet lists as fulfillments (J.4 examples).
Fine-tuned judges may overfit to SORRY-Bench style and miss out-of-distribution jailbroken patterns.

