Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
Role knowledge matters for apps that impersonate people or fictional characters; test models with RoleEval to reveal language and domain blind spots before deployment.
Summary TLDR
RoleEval is a bilingual (Chinese-English) multiple-choice benchmark targeting role knowledge about people and fictional characters. It collects 300 characters and 6,000 parallel questions (4,000 global + 2,000 China-focused). Questions cover basic facts and three types of multi-hop reasoning (relationships, event participants, timelines). RoleEval uses a hybrid quality check (GPT-4/3.5 automated filters + human review). Evaluations show GPT-4 leads on the global set (~76% few-shot on English), while large Chinese models (e.g., Qwen-72B) outperform GPT-4 on the Chinese subset, exposing language-specific knowledge gaps and scaling effects.
Problem Statement
There is no large, systematic bilingual benchmark for 'role knowledge'—facts and multi-hop reasoning about real and fictional characters. Existing persona tests are often synthetic or fragmented and do not measure whether pretrained models actually store and reason about detailed character knowledge across languages.
Main Contribution
RoleEval: a bilingual role-knowledge benchmark with 6,000 Chinese-English parallel multiple-choice questions covering 300 characters.
A question design that mixes direct factual questions, negation/non-occurrence formats, and three multi-hop reasoning types (relationship, event participant, timeline).
A hybrid quality-control pipeline that uses GPT-4/3.5 filtering plus human verification, and an extensive evaluation across many open and closed LLMs under zero- and few-shot settings.
Key Findings
RoleEval scale and scope
GPT-4 leads on global English evaluation
Some Chinese LLMs beat GPT-4 on the Chinese subset
Strong language-specific performance gaps
Scaling improves role knowledge but with limits
Results
RoleEval size
Accuracy
Accuracy
Cross-lingual drop (example)
Who Should Care
What To Try In 7 Days
Run your candidate models on a representative slice (100–300 RoleEval Qs) in the target language to reveal failure modes.
Compare a global model and a local model on the same RoleEval subset to check localization needs.
Use RoleEval-style automated filtering (GPT-4 + human review) when building your own domain QA items.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Timeliness: real-world facts change and the benchmark can become outdated.
- Single-answer format: multiple-correct-answer scenarios are not supported.
- Potential judge bias: the automatic QC relies on GPT-4/3.5, which could bias question difficulty.
When Not To Use
- For open-ended persona creation or dialogue evaluation that requires rich free-form responses.
- When multiple correct answers should be accepted or graded.
- As the sole measure of model safety or hallucination risk without additional tests.
Failure Modes
- Models hallucinate plausible but incorrect relationships or events for characters.
- Significant drop in performance when switching target language (cross-lingual transfer failure).
- Outdated or missing facts lead to systematic errors on real-world characters.
Core Entities
Models
- GPT-4
- GPT-3.5
- LLaMA
- LLaMA-2
- Falcon
- Mistral
- BLOOM
- Pythia
- Baichuan2
- Qwen
- Yi
- ChatGLM3
- MiniMax
- Skywork
Metrics
- Accuracy
- few-shot vs zero-shot delta
Datasets
- RoleEval (RoleEval-Global, RoleEval-Chinese)
- Wikipedia
- Baidu Baike
- Fandom
- Moegirlpedia
Benchmarks
- MMLU
- C-Eval

