Safety Benchmarks Papers — Parsed & Scored for Practitioners

A single-source survey of how we test LLMs: benchmarks, gaps, and practical directions

0.60

0.40

0.60

61

LLM evaluations show accuracy alone is insufficient: businesses must test truthfulness, bias, tool use, and robustness to avoid legal risks, bad UX, or harmful outputs.

Key finding

Public adoption exploded: ChatGPT reached 100 million users within two months of launch.

Numbers: 100M users in two months

Llama Guard — an adaptable LLM filter that flags unsafe user prompts and AI responses

0.70

0.35

0.45

44

Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.

Key finding

High in-policy classification performance on internal test set.

Numbers: AUPRC prompt=0.945; response=0.953 (Table 2)

Decouple helpfulness and harmlessness, then use a Lagrangian Safe-RL step to trade off both during RLHF

0.50

0.60

0.40

20

Safe RLHF lets you improve usefulness without sacrificing safety by separating labels and using a dynamic constraint; this reduces harmful outputs strongly while preserving or increasing helpfulness, lowering moderation load and risk.

Key finding

Iterative Safe RLHF reduced measured harmful responses from Alpaca-7B's 53.08% to 2.45% on the paper's evaluation set.

Numbers: Harmful probability 53.08% → 2.45%

SafetyBench: a bilingual 11,435-question multiple-choice benchmark to measure LLM safety across 7 categories

0.70

0.50

0.70

18

SafetyBench offers a fast, low-cost way to detect safety weaknesses across many categories and languages, helping teams find generation risks before user exposure.

Key finding

SafetyBench size and coverage

Numbers: 11,435 multiple-choice questions across 7 safety categories

A Chinese LLM safety benchmark plus 100k augmented safety prompts

0.50

0.60

16

A simple, repeatable Chinese safety benchmark and a 100k prompt library let product and security teams run systematic red-teaming and compare model choices quickly.

Key finding

Instruction attacks are consistently harder for models than typical safety prompts.

Chain-of-Utterances prompts reliably jailbreak LLMs; fine-tuning on curated safe conversations reduces harm.

0.50

0.60

0.40

16

CoU-style prompts can bypass deployed guardrails often; test and harden public-facing LLMs, or fine-tune smaller models on curated safe conversations to reduce harmful outputs without losing much utility.

Key finding

RED-EVAL jailbreaks widely deployed closed-source APIs frequently.

Numbers: GPT-4 ASR 0.651; ChatGPT ASR 0.728 on tested harmful prompts

CYBERSECEVAL: a broad benchmark measuring insecure code and malicious compliance in code-capable LLMs

0.70

0.80

0.70

15

Code-capable LLMs frequently suggest insecure code and may comply with malicious requests, so firms should test models automatically and add safety controls before deployment.

Key finding

Models produced vulnerable code a substantial fraction of the time.

Numbers: 30% of completions were vulnerable on CYBERSECEVAL tests

A module-oriented survey that maps safety risks, defenses, and benchmarks across input, model, toolchain, and output components

0.60

0.40

0.50

13

Mapping risks to system modules lets teams prioritize fixes (input guards, data curation, toolchain hardening, output filters) and reduce privacy, legal, and outage risks.

Key finding

LLM risks are multi-source and map cleanly to system modules.

Numbers: taxonomy: 4 modules, 12 risks, 44 sub-topics

WMDP: a public 3,668-question benchmark plus RMU unlearning to measure and remove hazardous LLM knowledge

0.60

0.70

13

WMDP + RMU let providers reduce hazardous knowledge in served models and demonstrate a practical mitigation that preserves most useful capabilities, lowering legal and reputational risk from malicious model use.

Key finding

WMDP is a sizable, vetted public benchmark for hazardous knowledge.

Numbers: 3,668 multiple-choice questions; development cost >$200K

CVALUES: a Chinese benchmark that measures LLMs on safety (rejecting harms) and responsibility (giving helpful, caring guidance).

0.60

0.50

0.40

13

Safety tuning reduces obvious harms, but models still fail to give responsible, empathetic, or legally careful answers; firms should test both rejection (safety) and guidance (responsibility) before deployment.

Key finding

Instruction‑tuned Chinese LLMs score high on human‑annotated safety.

Numbers: ChatGPT 96.9; Chinese‑Alpaca‑Plus‑7B 95.3; ChatGLM‑6B 95 (Table 2)

Latent-jailbreak benchmark: test if hidden malicious text breaks model safety or instruction following

0.50

0.60

0.50

11

Hidden malicious instructions can either make models produce unsafe content or cause over-refusal and lost utility; you must measure both safety and instruction-following before deployment to avoid surprise failures.

Key finding

Different models have very different jailbreak success rates on the same prompts.

Numbers: P1 jailbreak success: ChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6% (Table 4)

Trainable watermarking that injects more bits, preserves meaning, and resists removal

0.75

0.60

9

A practical watermarking layer lets API owners tag model outputs with recoverable signatures to prove origin, deter plagiarism, and monitor misuse without breaking text quality or adding large latency.

Key finding

REMARK-LLM embeds more signature bits per text than prior neural watermarking.

Numbers: ˜2× more bits vs AWT on evaluated segments

GPT-4 agents autonomously exploit sandboxed website vulnerabilities (11/15) and find at least one real XSS

0.20

0.60

8

High-capability LLM agents can automate complex web attacks at lower estimated cost than manual analysts, increasing the risk surface for companies that expose web interfaces.

Key finding

GPT-4 agent succeeded on most sandboxed vulnerabilities

Numbers: Pass@5 = 73.3%; overall success = 42.7% (Table 2)

A public benchmark that measures prompt injection, interpreter abuse, exploit generation, and a safety-utility tradeoff for LLMs

0.70

0.60

0.40

8

LLMs can betray system instructions and help abuse attached interpreters; measuring these behaviors helps product and security teams decide model choice, add guardrails, and quantify user experience tradeoffs.

Key finding

Prompt injections still succeed on modern models.

Numbers: Average injection success ≈ 28%; per-model range reported 13%–47%

AnimaLLM: a prototype that scores LLM outputs for truthfulness and how well they consider animals' interests

0.20

0.50

0.20

7

LLMs can embed species and welfare biases that affect products (education, vet advice, policy tools); measuring these biases early helps avoid reputational, legal, or welfare harms.

Key finding

AnimaLLM produced comprehensive score sets for two commercial LLMs.

Numbers: 3,264 S1 and 3,264 S2 scores per model

OR-Bench: a large, automated dataset to measure when LLMs wrongly refuse safe prompts

0.70

0.60

0.50

7

Over-refusal hurts user experience: safety tuning that increases toxic blocking can reduce helpfulness and raise support costs. Measure both safety and false refusals to avoid harming product usability.

Key finding

Safety and over-refusal are highly correlated.

Numbers: Spearman ρ = 0.89 (OR-Bench-Hard-1K)

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

0.60

0.40

6

MEDIC gives practical, faster checks for clinical readiness: it flags operational failures and hallucinations that standard exams miss, reducing deployment risk before costly pilots.

Key finding

Static knowledge benchmarks are saturated, but operational tasks lag far behind.

Numbers: Knowledge median >75% vs operational median <40% (Fig.4a)

Sallm: an automated benchmark, dataset, and metrics for measuring code-security of LLMs

0.60

0.65

0.45

6

Generated code can be functionally correct but insecure; automated security benchmarks reveal trade-offs between correctness and security so teams can pick models and pipelines that match risk tolerance.

Key finding

Repair component greatly increases executability of model outputs.

Numbers: compilation rate from 15% to 75% (avg); GPT-4 from <1% to 89%

Benchmark: Vision LLMs handle odd images but break on counterfactual text and simple ViT attacks

0.60

0.50

0.60

4

If you deploy image+text models, simple visual attacks and text changes can break behavior; test both inputs and add safety-aware visual instruction tuning before release.

Key finding

VLLMs answer OOD visual yes/no questions very well but fail when text is counterfactual.

Numbers: Yes/No accuracy >=95% on OOD images; counterfactual overall drop 17.1%, Yes/No drop 33.2% (Table 5)

FFT: a 2,116-instance benchmark that measures LLM factuality, fairness, and toxicity

0.40

0.55

0.30

4

FFT shows models can spread wrong facts, make biased decisions, or appear safe out of context; companies must test models for factual errors and context-aware toxicity before using them in products.

Key finding

Factuality is weak, especially on counterfactual prompts.

Numbers: Table 4: GPT-4 overall factuality 0.54; counterfacts accuracy 0.254

SafeEdit benchmark plus a one-example editing method (DINM) that erases toxic model regions to reduce jailbreaks

0.60

0.70

4

You can materially reduce many jailbreak-style safety failures by editing a single model layer with one curated example, saving compute and time compared to full re-alignment while keeping most capabilities.

Key finding

DINM strongly improves generalized detoxification on two tested models.

Numbers: DG-Avg LLaMA2-7B-Chat: 43.51% → 86.74%; Mistral-7B-v0.1: 47.30% → 96.84%

A living, structured review of 144 open LLM safety datasets and gaps to close

0.60

0.50

0.40

4

Model safety claims are often evaluated on a narrow, inconsistent set of datasets (sometimes proprietary), so businesses should adopt a broader, open suite of safety tests to make reliable, comparable claims.

Key finding

Total datasets reviewed: 144 open text datasets.

Numbers: n=144 datasets (published Jun 2018–Dec 2024)

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

0.85

0.42

0.65

4

SORRY-Bench lets product and risk teams measure whether a model will refuse harmful requests across many specific topics and prompt styles; this helps set provider and model selection policy and reduces surprise from prompt variants.

Key finding

SORRY-Bench provides balanced coverage across 44 fine-grained safety categories.

Numbers: 44 categories; 440 base instructions (10 per class).

LocalValueBench: a lightweight benchmark to test LLM alignment with Australian values

0.40

0.45

0.30

4

Models deployed in a region must match local legal and cultural expectations; using a local benchmark uncovers misalignment, refusal behaviors, and reviewer subjectivity before real users encounter them.

Key finding

Claude 3 Sonet scored highest on average for Australian value alignment

Numbers: mean=3.725 (scale 1–5)