Overview
The benchmark provides actionable tests and open data, but results are limited to tested models, specific prompt pools, and Chinese cultural context; use both auto and human checks before rollout.
Citations13
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 0/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Safety tuning reduces obvious harms, but models still fail to give responsible, empathetic, or legally careful answers; firms should test both rejection (safety) and guidance (responsibility) before deployment.
Who Should Care
Summary TLDR
CVALUES is a Chinese-focused benchmark that tests large language models on two value levels: Level‑1 safety (don’t produce harmful content) and Level‑2 responsibility (provide caring, societally aware guidance). The authors collected 2,100 adversarial and expert prompts, ran both human annotation and a multi‑choice automatic test (4,312 items), and evaluated popular Chinese and multilingual LLMs. Results: most instruction‑tuned models reject unsafe content well, but they often fail the higher bar of responsibility. The paper releases datasets and code for follow-up evaluation and model hardening.
Problem Statement
Existing Chinese LLM benchmarks measure knowledge and reasoning but not whether models follow human values. There is no comprehensive Chinese benchmark that tests both safety (avoid harm) and responsibility (give positive, socially aware guidance). This gap makes it hard to find value alignment failures before model release.
Main Contribution
CVALUES: first Chinese benchmark that explicitly tests two ascending value levels — safety and responsibility.
A mixed data collection: 1,300 adversarial safety prompts from crowdworkers and 800 responsibility prompts from domain experts.
Key Findings
Instruction‑tuned Chinese LLMs score high on human‑annotated safety.
Responsibility (helpful, empathetic, societally aware answers) is weaker than safety.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Human safety score (proportion safe) | ChatGPT 96.9; Chinese-Alpaca-Plus-7B 95.3; ChatGLM-6B 95; ChatPLUG-13B 94.7; Chinese-LLaMA-13B 53 | — | — | Human eval on 1,300 safety prompts (Section 2.2; Table 6; Table 2) | Table 2: human evaluation safety scores | Table 2 |
| Human responsibility score (1-10) | ChatPLUG-13B mean 6.5; Environmental Science 8.7; Law 5.2; Social Science 2.2 | — | — | Expert annotation for 800 responsibility prompts (Section 2.2; Table 7; Table 3) | Table 3: human responsibility scores by domain | Table 3 |
What To Try In 7 Days
Run the CVALUES multi-choice suite to catch obvious comprehension failures quickly.
Collect 100 domain‑specific responsibility prompts from your product teams and run human review on model outputs.
Add targeted supervision data (expert responses) for the weakest domains (e.g., law, social science) and re-evaluate.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Responsibility labels were gathered for ChatPLUG‑13B only; cross‑model human responsibility data is limited.
Automatic multi‑choice prompts test comprehension, not real generation quality.
When Not To Use
As the sole check for model safety or responsibility—do not rely only on multi‑choice accuracy.
For languages or cultures outside Chinese without revalidation.
Failure Modes
High multi‑choice accuracy while model still generates unsafe content in free‑form outputs.
Models default to over‑helpfulness and provide actionable guidance for illegal or unsafe requests.

