Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
13
Why It Matters For Business
Safety tuning reduces obvious harms, but models still fail to give responsible, empathetic, or legally careful answers; firms should test both rejection (safety) and guidance (responsibility) before deployment.
Summary TLDR
CVALUES is a Chinese-focused benchmark that tests large language models on two value levels: Level‑1 safety (don’t produce harmful content) and Level‑2 responsibility (provide caring, societally aware guidance). The authors collected 2,100 adversarial and expert prompts, ran both human annotation and a multi‑choice automatic test (4,312 items), and evaluated popular Chinese and multilingual LLMs. Results: most instruction‑tuned models reject unsafe content well, but they often fail the higher bar of responsibility. The paper releases datasets and code for follow-up evaluation and model hardening.
Problem Statement
Existing Chinese LLM benchmarks measure knowledge and reasoning but not whether models follow human values. There is no comprehensive Chinese benchmark that tests both safety (avoid harm) and responsibility (give positive, socially aware guidance). This gap makes it hard to find value alignment failures before model release.
Main Contribution
CVALUES: first Chinese benchmark that explicitly tests two ascending value levels — safety and responsibility.
A mixed data collection: 1,300 adversarial safety prompts from crowdworkers and 800 responsibility prompts from domain experts.
Two evaluation pipelines: human annotation of raw responses and multi‑choice automatic evaluation using 4,312 constructed QA items.
Public release of the benchmark, code, and a 145k paired comparison set (CVALUES-COMPARISON) to support automatic training/evaluation.
Key Findings
Instruction‑tuned Chinese LLMs score high on human‑annotated safety.
Responsibility (helpful, empathetic, societally aware answers) is weaker than safety.
Automatic multi‑choice tests and human evaluation measure different capabilities; models may score well on one and poorly on the other.
Results
Human safety score (proportion safe)
Human responsibility score (1-10)
Accuracy
Who Should Care
What To Try In 7 Days
Run the CVALUES multi-choice suite to catch obvious comprehension failures quickly.
Collect 100 domain‑specific responsibility prompts from your product teams and run human review on model outputs.
Add targeted supervision data (expert responses) for the weakest domains (e.g., law, social science) and re-evaluate.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Responsibility labels were gathered for ChatPLUG‑13B only; cross‑model human responsibility data is limited.
- Automatic multi‑choice prompts test comprehension, not real generation quality.
- Expert scoring is subjective and depends on the selected experts and domains.
When Not To Use
- As the sole check for model safety or responsibility—do not rely only on multi‑choice accuracy.
- For languages or cultures outside Chinese without revalidation.
- As a replacement for legal compliance review in regulated products.
Failure Modes
- High multi‑choice accuracy while model still generates unsafe content in free‑form outputs.
- Models default to over‑helpfulness and provide actionable guidance for illegal or unsafe requests.
- False rejections: helpful prompts are refused due to conservative safety tuning.
Core Entities
Models
- ChatGPT
- ChatGLM-6B
- BELLE-7B-2M
- ChatPLUG-3.7B
- ChatPLUG-13B
- MOSS
- Chinese-LLaMA-13B
- Chinese-Alpaca-Plus-7B
- Chinese-Alpaca-Plus-13B
- Ziya-LLaMA-13B
Metrics
- human safety score (proportion safe)
- human responsibility score (1-10)
- Accuracy
Datasets
- CVALUES
- CVALUES-COMPARISON
- 100PoisonMpts
Benchmarks
- CVALUES (safety & responsibility)

