A living, structured review of 144 open LLM safety datasets and gaps to close

April 8, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

4

Authors

Paul Röttger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy

Links

Abstract / PDF

Why It Matters For Business

Model safety claims are often evaluated on a narrow, inconsistent set of datasets (sometimes proprietary), so businesses should adopt a broader, open suite of safety tests to make reliable, comparable claims.

Summary TLDR

This paper systematically reviews 144 open text datasets (published June 2018–Dec 2024) relevant to evaluating or improving LLM safety. It catalogs dataset purpose, format, creation method, language, licensing, and publication source and publishes a living catalogue at SafetyPrompts.com. Key findings: most datasets are English-only (78.5%), evaluation-focused (77.8%), and many recent datasets are synthetic or templated; major gaps are non-English and naturalistic user-data evaluations. The authors show model releases and benchmarks use only a small, idiosyncratic subset of available safety datasets and call for standardised, broader evaluations.

Problem Statement

Many safety datasets exist, but they are fragmented and uneven: practitioners struggle to find the right datasets, current model evaluations use a narrow subset (often proprietary), and critical gaps remain—especially non-English coverage and naturalistic user data.

Main Contribution

A systematic catalog and structured review of 144 open LLM safety text datasets (cutoff Dec 17, 2024).

A public, continuously updated catalogue (SafetyPrompts.com) and reproducible spreadsheet with metadata and code.

Analysis of dataset properties (purpose, format, creation, language, license, publication) and how datasets are used in model releases and benchmarks.

Concrete diagnosis of gaps: dominance of English and lack of naturalistic datasets, and recommendations for more standardised evaluation practices.

Key Findings

Total datasets reviewed: 144 open text datasets.

Numbersn=144 datasets (published Jun 2018–Dec 2024)

Most datasets are English-only.

Numbers113/144 = 78.5% English-only

Most datasets are for evaluation, not training.

Numbers112/144 = 77.8% intended for eval only

Recent growth and reuse: 2023–24 saw rapid dataset production and higher reuse.

Numbers2023: 43 datasets (29.9%); 2024: 59 datasets; 34/59 in 2024 reused older data (57.6%)

Synthetic and templated generation is common in recent datasets.

Numbers21 of 102 datasets since 2023 (20.6%) fully synthetic; 26/144 (18.1%) use templates

Model release safety evaluations are narrow and often proprietary.

NumbersOnly 14 open datasets used across 29 model releases; 11/16 model releases that report safety use undisclosed proprietary

Results

Datasets reviewed

Value144

English-only datasets

Value113 (78.5%)

Eval-only datasets

Value112 (77.8%)

Datasets published in 2024

Value59

Baseline2023: 43

Open datasets used in model releases

Value14

Who Should Care

What To Try In 7 Days

Browse SafetyPrompts.com and pick 5 open datasets matching your product personas (user, adversary, vulnerable).

Add at least one multilingual and one naturalistic user-prompt dataset to your safety checks.

Run your model on a small common set (TruthfulQA, SimpleSafetyTests, DoNotAnswer, XSTest) and publish the results.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • The review only covers open text datasets published before Dec 17, 2024 and excludes multimodal and code-specific datasets.
  • The paper catalogues dataset metadata but does not provide a unified quality score for each dataset.
  • Despite a community-driven search, the review may still miss some relevant datasets.

When Not To Use

  • Do not use this review as a substitute for task-specific dataset quality checks or for proprietary dataset discovery.
  • Do not assume dataset suitability for training safety models without manual validation.

Failure Modes

  • Relying on templated or synthetic tests may overestimate safety under real user interactions.
  • Evaluating only on the few popular datasets can give a false sense of safety due to narrow coverage.
  • Multilingual safety blind spots if teams only run English tests.

Core Entities

Models

  • GPT-3.5
  • Llama-70b-chat
  • Gemma 2
  • GPT-4o
  • Claude

Datasets

  • TruthfulQA
  • BBQ
  • AnthropicRedTeam
  • RealToxicityPrompts
  • ToxiGen
  • WorldValuesBench
  • SimpleSafetyTests
  • DecodingTrust
  • HarmBench
  • XSTest
  • WildChat
  • DoNotAnswer
  • SafetyKit

Benchmarks

  • HELM
  • TrustLLM
  • LLM Safety Leaderboard
  • LMSYS Chatbot Arena
  • Evaluation Harness
  • RewardBench