A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

June 20, 20248 min

Overview

Production Readiness

0.85

Novelty Score

0.42

Cost Impact Score

0.65

Citation Count

4

Authors

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

Links

Abstract / PDF

Why It Matters For Business

SORRY-Bench lets product and risk teams measure whether a model will refuse harmful requests across many specific topics and prompt styles; this helps set provider and model selection policy and reduces surprise from prompt variants.

Summary TLDR

SORRY-Bench builds a fine-grained, class-balanced safety-refusal benchmark: 44 risk categories, 440 base unsafe instructions, and 20 linguistic mutations that create 8.8K extra variants. The authors collect 7K+ human judgments and run a meta-evaluation of automated evaluators. Key findings: fine-tuned ~7B judges match GPT-4-level agreement (~80%+) at far lower cost, and 56 models show wide divergence in refusal behavior (fulfillment rates range roughly 6%–90%). The repo, data, and code are publicly hosted for reproducible evaluation.

Problem Statement

Existing safety-refusal evaluations are coarse, imbalanced, and ignore prompt variations and judge design. This prevents reliable, granular measurement of whether aligned LLMs will refuse unsafe user requests across many realistic prompt styles and languages.

Main Contribution

A fine-grained 44-class safety taxonomy and a class-balanced base dataset of 440 unsafe instructions (10 per class).

20 linguistic mutations (questions, slang, encodings, 5 languages) that produce 8.8K mutated unsafe prompts for robustness testing.

A 7K+ human judgment dataset for (instruction, model response) pairs and a meta-evaluation showing fine-tuned ~7B judges can match larger LLMs.

A large benchmark across 50+ open and proprietary LLMs highlighting broad model variance in safety refusal behaviors.

Key Findings

SORRY-Bench provides balanced coverage across 44 fine-grained safety categories.

Numbers44 categories; 440 base instructions (10 per class).

Prompt wording and format matter: 20 mutations yield 8.8K extra variants and change model behavior.

Numbers20 mutations → 20×440 = 8,800 mutated prompts; mutations shift fulfillment by ±2–66% on examples.

Human judge dataset: 7,040 labels with ~30.4% fulfillment and 69.6% refusal.

Numbers7,040 annotations; 30.4% fulfillment, 69.6% refusal.

Fine-tuning small/medium LLMs yields high judge accuracy at low cost.

NumbersFine-tuned 7B judges reach ~81% Cohen Kappa agreement with humans; GPT-3.5+fine-tuned 83.8%.

Model refusal behavior varies widely across models and categories.

NumbersFulfillment rates span ≈6% to ≈90% across 56 models; some categories average ~9–11% fulfillment (harassment, child crime

Results

Human judgment dataset size

Value7,040 annotations

Automated judge agreement with humans (best fine-tuned)

Value83.8% Cohen Kappa (GPT-3.5 + fine-tuned)

BaselineGPT-4o prompt-only 78.9%

Judge inference time (example)

Value≈10–14s per pass (fine-tuned 7B judge on A100)

BaselineGPT-4o ≈260s per pass

Range of model fulfillment rates on SORRY-Bench

Value≈6% to ≈90% fulfillment

Frequently refused categories (average fulfillment)

ValueHarassment / Child-related crimes / Sexual crimes: ~9–11% fulfillment

Least refused categories (average fulfillment)

ValueLegal consulting, Religion, Ethical belief: ~74–80% fulfillment

Who Should Care

What To Try In 7 Days

Run SORRY-Bench base set (440 prompts) on candidate model to get per-category fulfillment rates.

Fine-tune a 7B judge on a small human-labeled sample (≈2.6K) to automate routine safety checks cheaply.

Run 5–10 linguistic mutations (e.g., question style, one low-resource language, a persuasion template) to probe robustness quickly.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Binary refusal labeling only; does not score degrees of harmfulness.
  • Does not cover multi-category compound prompts or many worst-case jailbreak attacks.
  • Dataset may be subject to contamination if adopted into model training without private splits.

When Not To Use

  • If you need graded harmfulness scores rather than binary refusal.
  • When evaluating extreme adversarial jailbreaks not represented by the 20 mutations.
  • If you require an uninterrupted utility metric—always-refuse models score perfectly by this metric but may be unusable.

Failure Modes

  • Judge misclassifies nuanced responses with disclaimers as refusals or bullet lists as fulfillments (J.4 examples).
  • Fine-tuned judges may overfit to SORRY-Bench style and miss out-of-distribution jailbroken patterns.
  • Encoding/encryption mutations can produce nonsense responses that are labeled as refusal even if a human could decode intent.

Core Entities

Models

  • GPT-4o
  • GPT-3.5-turbo
  • Claude-2
  • Gemini-1.5
  • Llama-3
  • Llama-2
  • Mistral-7b-instruct
  • Gemma
  • Vicuna
  • Zephyr
  • Dolphin
  • Mixtral

Metrics

  • fulfillment rate (fraction of responses that assist unsafe request)
  • Cohen Kappa agreement
  • refusal recall
  • fulfillment recall
  • time cost per evaluation pass

Datasets

  • SORRY-Bench
  • AdvBench
  • HarmBench
  • SALAD-Bench
  • ALERT
  • StrongREJECT

Benchmarks

  • SORRY-Bench
  • HarmBench
  • SALAD-Bench
  • ALERT
  • StrongREJECT