Camellia: a new benchmark that measures how multilingual LLMs favor Western vs Asian entities in nine Asian languages

October 6, 20258 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.2

Citation Count

0

Authors

Tarek Naous, Anagha Savit, Carlos Rafael Catalan, Geyang Guo, Jaehyeok Lee, Kyungdon Lee, Lheane Marie Dizon, Mengyu Ye, Neel Kothari, Sahajpreet Singh, Sarah Masud, Tanish Patwa, Trung Thanh Tran, Zohaib Khan, Alan Ritter, JinYeong Bak, Keisuke Sakaguchi, Tanmoy Chakraborty, Yuki Arase, Wei Xu

Links

Abstract / PDF

Why It Matters For Business

Multilingual LLMs can make culturally wrong or unfair choices in non-English settings. This affects product trust, moderation, search relevance, and personalization in Asia. Model selection and region-specific testing matter.

Summary TLDR

Camellia is a labeled benchmark (19,530 entities + 2,173 masked social-media contexts) for measuring whether multilingual LLMs prefer Western-associated entities over Asian ones across nine Asian languages. Tests on four open multilingual LLMs show systematic problems: models favor Western entities 30–40% of the time in grounded contexts, show model-specific sentiment biases, and have large extraction accuracy gaps (≈10–20%) in non‑English settings.

Problem Statement

LLMs can favor Western-associated entities and fail to adapt to local cultural contexts. There was no large multilingual, entity-focused benchmark for several Asian languages to measure these cultural biases systematically.

Main Contribution

Camellia dataset: 19,530 manually annotated entities (Asian vs Western) across six entity types and 2,173 naturally occurring masked contexts from social media in 9 Asian languages with English parallels.

Evaluation suite and metrics: likelihood-based Cultural Bias Score (CBS), sentiment-association tests, and extractive QA setups to measure entity-centric cultural bias.

Empirical study: run CBS, sentiment, and extractive QA evaluations on four open multilingual LLM families (Llama3.3-70b, Qwen2.5-72b, Aya-expanse-32b, Gemma3-27b) and analyze language- and model-specific patterns.

Key Findings

LLMs often prefer Western entities even when the context requires an Asian entity.

NumbersCBS ≈ 30–40% on culturally-grounded contexts (expected ~5%)

Sentiment outputs shift depending on which model family is used.

NumbersModel-specific FN/FP differences visible in Figure 5 (Llama/Gemma bias toward Western negativity; Qwen/Aya toward Asian/

Extractive QA accuracy shows large culture gaps in native Asian languages but small gaps in English.

NumbersAccuracy gaps in Asian languages often 10–20%; English gaps mostly 1–5%

Models trained or developed with stronger local data do better on related languages.

NumbersQwen2.5-72b outperforms others on zh/ja/ko (noted gap in §3.1)

Benchmark is high quality but unevenly distributed across languages.

NumbersInter-annotator Cohen's kappa 0.78–0.97 across languages; fewer contexts for Urdu

Results

Cultural Bias Score (CBS)

ValueModels assign Western entities higher likelihood ~30–40% on grounded contexts

BaselineExpected ~5% for grounded contexts

Sentiment association differences (FN/FP)

ValueModel-specific shifts: Llama/Gemma skew Western→negative; Qwen/Aya skew Asian→positive

BaselineNo systematic FN/FP difference

Accuracy

ValueGaps of ~10–20% in native Asian languages; ~1–5% in English

BaselineSmall gaps expected if model understands context equally

Annotation agreement

ValueCohen's kappa 0.78–0.97 by language

BaselineHigh agreement (>0.75)

Who Should Care

What To Try In 7 Days

Run Camellia's small sample (a few contexts) in your target languages to spot obvious Western/Asian mismatches.

If you use an off-the-shelf model, compare at least two families (one with local provenance) on Camellia to surface model-specific sentiment or extraction errors.

Add simple post-rules or a shortlist filter for entity types (names, locations, foods) when context requires local entities.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Data coverage uneven: low-resource languages (e.g., Urdu) have fewer masked contexts and entities.
  • Binary cultural labeling (Asian vs Western) simplifies real cultural overlap and diasporic usage.
  • Masked social-media contexts come from X and were filtered for toxicity; this may bias topic distribution.

When Not To Use

  • Not a substitute for fine-grained cultural opinion or policy research that needs richer sociocultural labels.
  • Not appropriate for languages or cultures outside the nine covered.
  • Not a mitigation recipe—benchmark flags issues but does not provide automatic fixes.

Failure Modes

  • Rare or unseen local entities may be tokenized poorly and cause model failures unrelated to culture.
  • Using LLMs as judges on cultural correctness can inherit model-specific sentiment biases (judge bias).
  • Parallel English tests can mask problems that only appear in native-language usage.

Core Entities

Models

  • Llama3.3-70b
  • Qwen2.5-72b
  • Aya-expanse-32b
  • Gemma3-27b

Metrics

  • CBS (Cultural Bias Score)
  • Accuracy
  • Sentiment FN/FP differences

Datasets

  • Camellia (this paper)

Benchmarks

  • Camellia

Context Entities

Models

  • meta-llama/Llama-3.3-70B-Instruct
  • Qwen/Qwen2.5-72B-Instruct
  • CohereForAI/aya-expanse-32b
  • google/gemma-3-27b-it

Metrics

  • Cohen's Kappa (annotation agreement)

Datasets

  • mC4 (used for extraction)
  • Wikidata (entity source)