A Chinese LLM safety benchmark plus 100k augmented safety prompts

April 20, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

16

Authors

Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, Minlie Huang

Links

Abstract / PDF

Why It Matters For Business

A simple, repeatable Chinese safety benchmark and a 100k prompt library let product and security teams run systematic red-teaming and compare model choices quickly.

Summary TLDR

The paper builds a Chinese-focused safety benchmark that tests 14 safety dimensions (8 typical scenarios + 6 instruction-attacks), evaluates 15 LLMs using an LLM-based automatic judge, and releases SAFETYPROMPTS — a 100k augmented prompt/response library. Key findings: instruction attacks reveal more failures than typical prompts, ChatGPT ranks highest especially on instruction attacks, and automated evaluation relies on an LLM evaluator (InstructGPT) with simple verbalization rules. Use the benchmark for red-teaming and comparative checks, but not as a final safety certification because the evaluator and single-turn tests have known blind spots.

Problem Statement

Large Chinese LLMs can produce insulting, biased, illegal, or harmful outputs. There was no large, public Chinese safety benchmark or prompt library for red-teaming and comparison. The paper fills that gap with a taxonomy, test prompts, automatic LLM-based scoring, a leaderboard, and a 100k augmented prompt dataset.

Main Contribution

A structured safety taxonomy: 8 typical scenarios and 6 instruction-attack types for Chinese LLMs.

A manually written test prompt set (∼8.9k prompts) split into public and private test sets.

An automatic safety evaluator that uses an LLM (InstructGPT) to label responses as safe/unsafe.

A public leaderboard that ranks 15 evaluated LLMs across 14 safety dimensions.

SAFETYPROMPTS: a released library of 100k augmented prompts and model responses for red-teaming and tuning.

Key Findings

Instruction attacks are consistently harder for models than typical safety prompts.

ChatGPT ranks highest on overall safety and especially on instruction attacks.

NumbersGoal Hijacking: ChatGPT > next model by >20 points

The authors released a 100k augmented prompt and response library named SAFETYPROMPTS.

Numbers100k augmented prompts; per-category counts in Table 2

Evaluation is automated with an LLM evaluator (InstructGPT) and a simple verbalizer rule.

NumbersEvaluator uses greedy decoding and flags unsafe when generation contains '不' (no)

Manually collected test prompts total about 8,881 examples across 14 scenarios.

Numbers8,881 prompts (Table 1 sums per category)

Even ChatGPT produced unsafe responses in the augmented dataset at low rates.

NumbersChatGPT unsafe ratio ≈ 1.6%

Results

Instruction-attack vs typical safety score

ValueInstruction-attack scores are lower than typical scenario scores for every evaluated model

Goal Hijacking gap

ValueChatGPT leads by >20 points

Baselinesecond-best model

Size of supplemented prompts

Value100k augmented prompts and responses

Baselinepublic test prompts (~2k)

ChatGPT unsafe response rate (on augmented set)

Value≈1.6% unsafe

Who Should Care

What To Try In 7 Days

Run your model against SAFETYPROMPTS to find easy-to-trigger failures.

Add the six instruction-attack types (goal hijacking, role-play, etc.) to your QA checks.

Use an LLM evaluator to triage candidate failures, then verify high-risk cases with humans.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation relies on an LLM judge (InstructGPT) and a simple verbalizer, which can introduce bias and false labels.
  • Tests are mostly single-turn; multi-turn conversations are not covered.
  • Sensitive Topics prompts were partially withheld, reducing full transparency.
  • Decoding choices and randomness can change model outputs and scores.

When Not To Use

  • Do not use benchmark scores as the only evidence of safety for deployment decisions.
  • Avoid relying solely on the LLM evaluator for high-risk content without human review.
  • Do not treat single-turn results as representative of multi-turn interactions.

Failure Modes

  • LLM evaluator misses subtle unsafe content (false negatives).
  • Adversarially crafted multi-turn prompts bypass single-turn tests.
  • Augmentation process can produce highly toxic examples that bias tuning if used blindly.

Core Entities

Models

  • ChatGPT
  • GPT-4
  • InstructGPT (text-davinci-003)
  • GPT-3.5-turbo
  • ChatGLM
  • MiniChat
  • OPD
  • EVA

Metrics

  • Per-scenario safety score (proportion of safe responses)
  • Macro-average typical-scenarios score (Ā)
  • Macro-average instruction-attacks score (Ḃ)
  • Overall safety score (S)

Datasets

  • Manually-written safety prompts (∼8.9k, Table 1)
  • SAFETYPROMPTS (100k augmented prompts and responses)

Benchmarks

  • Chinese LLM Safety Assessment (8 scenarios + 6 instruction attacks)
  • Leaderboard at http://coai.cs.tsinghua.edu.cn/leaderboard/