A Chinese LLM safety benchmark plus 100k augmented safety prompts

Overview

Decision SnapshotNeeds Validation

The benchmark and 100k prompt library are practical for red-teaming and comparative evaluation, but the LLM-based automatic judge and single-turn focus limit reliability as a production safety certificate.

Citations16

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, Minlie Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A simple, repeatable Chinese safety benchmark and a 100k prompt library let product and security teams run systematic red-teaming and compare model choices quickly.

Who Should Care

Product Manager CTO ML Engineer Founder Engineering Lead

Summary TLDR

The paper builds a Chinese-focused safety benchmark that tests 14 safety dimensions (8 typical scenarios + 6 instruction-attacks), evaluates 15 LLMs using an LLM-based automatic judge, and releases SAFETYPROMPTS — a 100k augmented prompt/response library. Key findings: instruction attacks reveal more failures than typical prompts, ChatGPT ranks highest especially on instruction attacks, and automated evaluation relies on an LLM evaluator (InstructGPT) with simple verbalization rules. Use the benchmark for red-teaming and comparative checks, but not as a final safety certification because the evaluator and single-turn tests have known blind spots.

Problem Statement

Large Chinese LLMs can produce insulting, biased, illegal, or harmful outputs. There was no large, public Chinese safety benchmark or prompt library for red-teaming and comparison. The paper fills that gap with a taxonomy, test prompts, automatic LLM-based scoring, a leaderboard, and a 100k augmented prompt dataset.

Main Contribution

A structured safety taxonomy: 8 typical scenarios and 6 instruction-attack types for Chinese LLMs.

A manually written test prompt set (∼8.9k prompts) split into public and private test sets.

Key Findings

Instruction attacks are consistently harder for models than typical safety prompts.

Practical UseInclude instruction-attack style tests when red-teaming; ordinary safety tests will undercount risks.

Evidence RefSection 4, Figure 4

ChatGPT ranks highest on overall safety and especially on instruction attacks.

NumbersGoal Hijacking: ChatGPT > next model by >20 points

Practical UseTreat ChatGPT as a strong baseline, but do not assume immunity—other models need targeted defense against instruction attacks.

Evidence RefSection 4 (observation 6)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Instruction-attack vs typical safety score	Instruction-attack scores are lower than typical scenario scores for every evaluated model	—	—	Evaluated on benchmark prompts (Table 1)	Section 4, observation 5	Figure 4
Goal Hijacking gap	ChatGPT leads by >20 points	second-best model	>20 points	Goal Hijacking subset (Table 1)	Section 4, observation 6	Figure 4

What To Try In 7 Days

Run your model against SAFETYPROMPTS to find easy-to-trigger failures.

Add the six instruction-attack types (goal hijacking, role-play, etc.) to your QA checks.

Use an LLM evaluator to triage candidate failures, then verify high-risk cases with humans.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/thu-coai/Safety-Prompts http://coai.cs.tsinghua.edu.cn/leaderboard/

Data URLs

https://github.com/thu-coai/Safety-Prompts http://coai.cs.tsinghua.edu.cn/leaderboard/

Risks & Boundaries

Limitations

Evaluation relies on an LLM judge (InstructGPT) and a simple verbalizer, which can introduce bias and false labels.

Tests are mostly single-turn; multi-turn conversations are not covered.

When Not To Use

Do not use benchmark scores as the only evidence of safety for deployment decisions.

Avoid relying solely on the LLM evaluator for high-risk content without human review.

Failure Modes

LLM evaluator misses subtle unsafe content (false negatives).

Adversarially crafted multi-turn prompts bypass single-turn tests.

Core Entities

Models

ChatGPTGPT-4InstructGPT (text-davinci-003)GPT-3.5-turboChatGLMMiniChatOPDEVA

Metrics

Per-scenario safety score (proportion of safe responses)Macro-average typical-scenarios score (Ā)Macro-average instruction-attacks score (Ḃ)Overall safety score (S)

Datasets

Manually-written safety prompts (∼8.9k, Table 1)SAFETYPROMPTS (100k augmented prompts and responses)

Benchmarks

Chinese LLM Safety Assessment (8 scenarios + 6 instruction attacks)Leaderboard at http://coai.cs.tsinghua.edu.cn/leaderboard/

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruction attacks are consistently harder for models than typical safety prompts.

ChatGPT ranks highest on overall safety and especially on instruction attacks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding