A living, structured review of 144 open LLM safety datasets and gaps to close

April 8, 20248 min

Overview

Decision SnapshotReady For Pilot

This is a useful, up-to-date catalog and analysis for teams auditing or expanding safety testing. It compiles evidence across 144 datasets but does not itself rate dataset quality; users must still pick datasets appropriate to their product and validate on real user data.

Citations4

Evidence Strength0.80

Confidence0.82

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Paul Röttger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Model safety claims are often evaluated on a narrow, inconsistent set of datasets (sometimes proprietary), so businesses should adopt a broader, open suite of safety tests to make reliable, comparable claims.

Who Should Care

Summary TLDR

This paper systematically reviews 144 open text datasets (published June 2018–Dec 2024) relevant to evaluating or improving LLM safety. It catalogs dataset purpose, format, creation method, language, licensing, and publication source and publishes a living catalogue at SafetyPrompts.com. Key findings: most datasets are English-only (78.5%), evaluation-focused (77.8%), and many recent datasets are synthetic or templated; major gaps are non-English and naturalistic user-data evaluations. The authors show model releases and benchmarks use only a small, idiosyncratic subset of available safety datasets and call for standardised, broader evaluations.

Problem Statement

Many safety datasets exist, but they are fragmented and uneven: practitioners struggle to find the right datasets, current model evaluations use a narrow subset (often proprietary), and critical gaps remain—especially non-English coverage and naturalistic user data.

Main Contribution

A systematic catalog and structured review of 144 open LLM safety text datasets (cutoff Dec 17, 2024).

A public, continuously updated catalogue (SafetyPrompts.com) and reproducible spreadsheet with metadata and code.

Key Findings

Total datasets reviewed: 144 open text datasets.

Numbersn=144 datasets (published Jun 2018–Dec 2024)

Practical UseUse the SafetyPrompts catalogue to explore all 144 datasets instead of ad-hoc search.

Evidence RefAbstract; §2.2

Most datasets are English-only.

Numbers113/144 = 78.5% English-only

Practical UseExpect low coverage for non-English safety testing; prioritize building or sourcing multilingual evaluations.

Evidence Ref§3.6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Datasets reviewed144Total number of open LLM safety datasets included in the reviewAbstract; §2.2
English-only datasets113 (78.5%)Share of datasets that are exclusively English§3.6

What To Try In 7 Days

Browse SafetyPrompts.com and pick 5 open datasets matching your product personas (user, adversary, vulnerable).

Add at least one multilingual and one naturalistic user-prompt dataset to your safety checks.

Run your model on a small common set (TruthfulQA, SimpleSafetyTests, DoNotAnswer, XSTest) and publish the results.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

The review only covers open text datasets published before Dec 17, 2024 and excludes multimodal and code-specific datasets.

The paper catalogues dataset metadata but does not provide a unified quality score for each dataset.

When Not To Use

Do not use this review as a substitute for task-specific dataset quality checks or for proprietary dataset discovery.

Do not assume dataset suitability for training safety models without manual validation.

Failure Modes

Relying on templated or synthetic tests may overestimate safety under real user interactions.

Evaluating only on the few popular datasets can give a false sense of safety due to narrow coverage.

Core Entities

Models

GPT-3.5Llama-70b-chatGemma 2GPT-4oClaude

Datasets

TruthfulQABBQAnthropicRedTeamRealToxicityPromptsToxiGenWorldValuesBenchSimpleSafetyTestsDecodingTrustHarmBenchXSTestWildChatDoNotAnswerSafetyKit

Benchmarks

HELMTrustLLMLLM Safety LeaderboardLMSYS Chatbot ArenaEvaluation HarnessRewardBench