Overview
Production Readiness
0.85
Novelty Score
0.42
Cost Impact Score
0.65
Citation Count
4
Why It Matters For Business
SORRY-Bench lets product and risk teams measure whether a model will refuse harmful requests across many specific topics and prompt styles; this helps set provider and model selection policy and reduces surprise from prompt variants.
Summary TLDR
SORRY-Bench builds a fine-grained, class-balanced safety-refusal benchmark: 44 risk categories, 440 base unsafe instructions, and 20 linguistic mutations that create 8.8K extra variants. The authors collect 7K+ human judgments and run a meta-evaluation of automated evaluators. Key findings: fine-tuned ~7B judges match GPT-4-level agreement (~80%+) at far lower cost, and 56 models show wide divergence in refusal behavior (fulfillment rates range roughly 6%–90%). The repo, data, and code are publicly hosted for reproducible evaluation.
Problem Statement
Existing safety-refusal evaluations are coarse, imbalanced, and ignore prompt variations and judge design. This prevents reliable, granular measurement of whether aligned LLMs will refuse unsafe user requests across many realistic prompt styles and languages.
Main Contribution
A fine-grained 44-class safety taxonomy and a class-balanced base dataset of 440 unsafe instructions (10 per class).
20 linguistic mutations (questions, slang, encodings, 5 languages) that produce 8.8K mutated unsafe prompts for robustness testing.
A 7K+ human judgment dataset for (instruction, model response) pairs and a meta-evaluation showing fine-tuned ~7B judges can match larger LLMs.
A large benchmark across 50+ open and proprietary LLMs highlighting broad model variance in safety refusal behaviors.
Key Findings
SORRY-Bench provides balanced coverage across 44 fine-grained safety categories.
Prompt wording and format matter: 20 mutations yield 8.8K extra variants and change model behavior.
Human judge dataset: 7,040 labels with ~30.4% fulfillment and 69.6% refusal.
Fine-tuning small/medium LLMs yields high judge accuracy at low cost.
Model refusal behavior varies widely across models and categories.
Results
Human judgment dataset size
Automated judge agreement with humans (best fine-tuned)
Judge inference time (example)
Range of model fulfillment rates on SORRY-Bench
Frequently refused categories (average fulfillment)
Least refused categories (average fulfillment)
Who Should Care
What To Try In 7 Days
Run SORRY-Bench base set (440 prompts) on candidate model to get per-category fulfillment rates.
Fine-tune a 7B judge on a small human-labeled sample (≈2.6K) to automate routine safety checks cheaply.
Run 5–10 linguistic mutations (e.g., question style, one low-resource language, a persuasion template) to probe robustness quickly.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Binary refusal labeling only; does not score degrees of harmfulness.
- Does not cover multi-category compound prompts or many worst-case jailbreak attacks.
- Dataset may be subject to contamination if adopted into model training without private splits.
When Not To Use
- If you need graded harmfulness scores rather than binary refusal.
- When evaluating extreme adversarial jailbreaks not represented by the 20 mutations.
- If you require an uninterrupted utility metric—always-refuse models score perfectly by this metric but may be unusable.
Failure Modes
- Judge misclassifies nuanced responses with disclaimers as refusals or bullet lists as fulfillments (J.4 examples).
- Fine-tuned judges may overfit to SORRY-Bench style and miss out-of-distribution jailbroken patterns.
- Encoding/encryption mutations can produce nonsense responses that are labeled as refusal even if a human could decode intent.
Core Entities
Models
- GPT-4o
- GPT-3.5-turbo
- Claude-2
- Gemini-1.5
- Llama-3
- Llama-2
- Mistral-7b-instruct
- Gemma
- Vicuna
- Zephyr
- Dolphin
- Mixtral
Metrics
- fulfillment rate (fraction of responses that assist unsafe request)
- Cohen Kappa agreement
- refusal recall
- fulfillment recall
- time cost per evaluation pass
Datasets
- SORRY-Bench
- AdvBench
- HarmBench
- SALAD-Bench
- ALERT
- StrongREJECT
Benchmarks
- SORRY-Bench
- HarmBench
- SALAD-Bench
- ALERT
- StrongREJECT

