RoleEval — 6,000 bilingual multiple-choice questions testing LLMs' knowledge of 300 real and fictional characters

December 26, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

2

Authors

Tianhao Shen, Sun Li, Quan Tu, Deyi Xiong

Links

Abstract / PDF

Why It Matters For Business

Role knowledge matters for apps that impersonate people or fictional characters; test models with RoleEval to reveal language and domain blind spots before deployment.

Summary TLDR

RoleEval is a bilingual (Chinese-English) multiple-choice benchmark targeting role knowledge about people and fictional characters. It collects 300 characters and 6,000 parallel questions (4,000 global + 2,000 China-focused). Questions cover basic facts and three types of multi-hop reasoning (relationships, event participants, timelines). RoleEval uses a hybrid quality check (GPT-4/3.5 automated filters + human review). Evaluations show GPT-4 leads on the global set (~76% few-shot on English), while large Chinese models (e.g., Qwen-72B) outperform GPT-4 on the Chinese subset, exposing language-specific knowledge gaps and scaling effects.

Problem Statement

There is no large, systematic bilingual benchmark for 'role knowledge'—facts and multi-hop reasoning about real and fictional characters. Existing persona tests are often synthetic or fragmented and do not measure whether pretrained models actually store and reason about detailed character knowledge across languages.

Main Contribution

RoleEval: a bilingual role-knowledge benchmark with 6,000 Chinese-English parallel multiple-choice questions covering 300 characters.

A question design that mixes direct factual questions, negation/non-occurrence formats, and three multi-hop reasoning types (relationship, event participant, timeline).

A hybrid quality-control pipeline that uses GPT-4/3.5 filtering plus human verification, and an extensive evaluation across many open and closed LLMs under zero- and few-shot settings.

Key Findings

RoleEval scale and scope

Numbers6,000 questions; 300 characters (200 global + 100 Chinese)

GPT-4 leads on global English evaluation

NumbersGPT-4-1106 few-shot avg ≈ 76.00% on RoleEval-Global (en)

Some Chinese LLMs beat GPT-4 on the Chinese subset

NumbersQwen-72B few-shot avg 66.20% vs GPT-4-1106 62.75% on RoleEval-Chinese (zh)

Strong language-specific performance gaps

NumbersGPT-3.5-0613 few-shot: en 58.43% vs zh 48.73% (≈ −9.7pp)

Scaling improves role knowledge but with limits

NumbersParameter/token scaling improves for LLaMA and Qwen; checkpoints show steady gains after ~500B tokens

Results

RoleEval size

Value6,000 questions across 300 characters

Accuracy

Value76.00%

BaselineGPT-3.5 few-shot 56.82%

Accuracy

Value66.20%

BaselineGPT-4-1106 few-shot 62.75%

Cross-lingual drop (example)

ValueGPT-3.5-0613 few-shot en 58.43% → zh 48.73%

Who Should Care

What To Try In 7 Days

Run your candidate models on a representative slice (100–300 RoleEval Qs) in the target language to reveal failure modes.

Compare a global model and a local model on the same RoleEval subset to check localization needs.

Use RoleEval-style automated filtering (GPT-4 + human review) when building your own domain QA items.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Timeliness: real-world facts change and the benchmark can become outdated.
  • Single-answer format: multiple-correct-answer scenarios are not supported.
  • Potential judge bias: the automatic QC relies on GPT-4/3.5, which could bias question difficulty.

When Not To Use

  • For open-ended persona creation or dialogue evaluation that requires rich free-form responses.
  • When multiple correct answers should be accepted or graded.
  • As the sole measure of model safety or hallucination risk without additional tests.

Failure Modes

  • Models hallucinate plausible but incorrect relationships or events for characters.
  • Significant drop in performance when switching target language (cross-lingual transfer failure).
  • Outdated or missing facts lead to systematic errors on real-world characters.

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • LLaMA
  • LLaMA-2
  • Falcon
  • Mistral
  • BLOOM
  • Pythia
  • Baichuan2
  • Qwen
  • Yi
  • ChatGLM3
  • MiniMax
  • Skywork

Metrics

  • Accuracy
  • few-shot vs zero-shot delta

Datasets

  • RoleEval (RoleEval-Global, RoleEval-Chinese)
  • Wikipedia
  • Baidu Baike
  • Fandom
  • Moegirlpedia

Benchmarks

  • MMLU
  • C-Eval