CVALUES: a Chinese benchmark that measures LLMs on safety (rejecting harms) and responsibility (giving helpful, caring guidance).

Overview

Decision SnapshotNeeds Validation

The benchmark provides actionable tests and open data, but results are limited to tested models, specific prompt pools, and Chinese cultural context; use both auto and human checks before rollout.

Citations13

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, Ji Zhang, Chao Peng, Fei Huang, Jingren Zhou

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Safety tuning reduces obvious harms, but models still fail to give responsible, empathetic, or legally careful answers; firms should test both rejection (safety) and guidance (responsibility) before deployment.

Who Should Care

Product Manager CTO ML Engineer Founder Engineering Lead

Summary TLDR

CVALUES is a Chinese-focused benchmark that tests large language models on two value levels: Level‑1 safety (don’t produce harmful content) and Level‑2 responsibility (provide caring, societally aware guidance). The authors collected 2,100 adversarial and expert prompts, ran both human annotation and a multi‑choice automatic test (4,312 items), and evaluated popular Chinese and multilingual LLMs. Results: most instruction‑tuned models reject unsafe content well, but they often fail the higher bar of responsibility. The paper releases datasets and code for follow-up evaluation and model hardening.

Problem Statement

Existing Chinese LLM benchmarks measure knowledge and reasoning but not whether models follow human values. There is no comprehensive Chinese benchmark that tests both safety (avoid harm) and responsibility (give positive, socially aware guidance). This gap makes it hard to find value alignment failures before model release.

Main Contribution

CVALUES: first Chinese benchmark that explicitly tests two ascending value levels — safety and responsibility.

A mixed data collection: 1,300 adversarial safety prompts from crowdworkers and 800 responsibility prompts from domain experts.

Key Findings

Instruction‑tuned Chinese LLMs score high on human‑annotated safety.

NumbersChatGPT 96.9; Chinese‑Alpaca‑Plus‑7B 95.3; ChatGLM‑6B 95 (Table 2)

Practical UseIf you instruction‑tune on safety data (or use RLHF), expect solid performance rejecting overtly harmful prompts; use this as a baseline for release gating.

Evidence RefTable 2 (human safety scores)

Responsibility (helpful, empathetic, societally aware answers) is weaker than safety.

NumbersChatPLUG‑13B mean responsibility 6.5/10; Law 5.2; Social Science 2.2 (Table 3)

Practical UseDon't rely on rejection alone. Add domain expert review and targeted instruction/data for responsibility before deploying in sensitive areas (law, social science).

Evidence RefTable 3 (human responsibility scores)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Human safety score (proportion safe)	ChatGPT 96.9; Chinese-Alpaca-Plus-7B 95.3; ChatGLM-6B 95; ChatPLUG-13B 94.7; Chinese-LLaMA-13B 53	—	—	Human eval on 1,300 safety prompts (Section 2.2; Table 6; Table 2)	Table 2: human evaluation safety scores	Table 2
Human responsibility score (1-10)	ChatPLUG-13B mean 6.5; Environmental Science 8.7; Law 5.2; Social Science 2.2	—	—	Expert annotation for 800 responsibility prompts (Section 2.2; Table 7; Table 3)	Table 3: human responsibility scores by domain	Table 3

What To Try In 7 Days

Run the CVALUES multi-choice suite to catch obvious comprehension failures quickly.

Collect 100 domain‑specific responsibility prompts from your product teams and run human review on model outputs.

Add targeted supervision data (expert responses) for the weakest domains (e.g., law, social science) and re-evaluate.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/X-PLUG/CValues https://www.modelscope.cn/datasets/damo/CValuesComparison/summary

Data URLs

https://www.modelscope.cn/datasets/damo/CValuesComparison/summary https://github.com/X-PLUG/CValues https://modelscope.cn/datasets/damo/100PoisonMpts/summary

Risks & Boundaries

Limitations

Responsibility labels were gathered for ChatPLUG‑13B only; cross‑model human responsibility data is limited.

Automatic multi‑choice prompts test comprehension, not real generation quality.

When Not To Use

As the sole check for model safety or responsibility—do not rely only on multi‑choice accuracy.

For languages or cultures outside Chinese without revalidation.

Failure Modes

High multi‑choice accuracy while model still generates unsafe content in free‑form outputs.

Models default to over‑helpfulness and provide actionable guidance for illegal or unsafe requests.

Core Entities

Models

ChatGPTChatGLM-6BBELLE-7B-2MChatPLUG-3.7BChatPLUG-13BMOSSChinese-LLaMA-13BChinese-Alpaca-Plus-7BChinese-Alpaca-Plus-13BZiya-LLaMA-13B

Metrics

human safety score (proportion safe)human responsibility score (1-10)Accuracy

Datasets

CVALUESCVALUES-COMPARISON100PoisonMpts

Benchmarks

CVALUES (safety & responsibility)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruction‑tuned Chinese LLMs score high on human‑annotated safety.

Responsibility (helpful, empathetic, societally aware answers) is weaker than safety.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding