Camellia: a new benchmark that measures how multilingual LLMs favor Western vs Asian entities in nine Asian languages

October 6, 20258 min

Overview

Decision SnapshotNeeds Validation

Dataset is a large, well-annotated benchmark with solid annotation agreement. Results cover four open multilingual models and multiple evaluation setups. Low-resource languages have fewer contexts, reducing power there.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 40%

Novelty: 30%

Authors

Tarek Naous, Anagha Savit, Carlos Rafael Catalan, Geyang Guo, Jaehyeok Lee, Kyungdon Lee, Lheane Marie Dizon, Mengyu Ye, Neel Kothari, Sahajpreet Singh, Sarah Masud, Tanish Patwa, Trung Thanh Tran, Zohaib Khan, Alan Ritter, JinYeong Bak, Keisuke Sakaguchi, Tanmoy Chakraborty, Yuki Arase, Wei Xu

Links

Abstract / PDF / Data

Why It Matters For Business

Multilingual LLMs can make culturally wrong or unfair choices in non-English settings. This affects product trust, moderation, search relevance, and personalization in Asia. Model selection and region-specific testing matter.

Who Should Care

Summary TLDR

Camellia is a labeled benchmark (19,530 entities + 2,173 masked social-media contexts) for measuring whether multilingual LLMs prefer Western-associated entities over Asian ones across nine Asian languages. Tests on four open multilingual LLMs show systematic problems: models favor Western entities 30–40% of the time in grounded contexts, show model-specific sentiment biases, and have large extraction accuracy gaps (≈10–20%) in non‑English settings.

Problem Statement

LLMs can favor Western-associated entities and fail to adapt to local cultural contexts. There was no large multilingual, entity-focused benchmark for several Asian languages to measure these cultural biases systematically.

Main Contribution

Camellia dataset: 19,530 manually annotated entities (Asian vs Western) across six entity types and 2,173 naturally occurring masked contexts from social media in 9 Asian languages with English parallels.

Evaluation suite and metrics: likelihood-based Cultural Bias Score (CBS), sentiment-association tests, and extractive QA setups to measure entity-centric cultural bias.

Key Findings

LLMs often prefer Western entities even when the context requires an Asian entity.

NumbersCBS ≈ 3040% on culturally-grounded contexts (expected ~5%)

Practical UseDon't assume multilingual LLMs pick locally appropriate people/places/foods; add checks or local-data fine-tuning before deploying in non-Western languages.

Evidence RefFigure 3; §3.1

Sentiment outputs shift depending on which model family is used.

NumbersModel-specific FN/FP differences visible in Figure 5 (Llama/Gemma bias toward Western negativity; Qwen/Aya toward Asian/

Practical UseModel choice matters for content-moderation or sentiment tasks: test the exact model + language combination on local data before trusting automated moderation.

Evidence RefFigure 5; §3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Cultural Bias Score (CBS)Models assign Western entities higher likelihood ~3040% on grounded contextsExpected ~5% for grounded contexts≈+25–35pp over expectationCamellia-Grounded across 9 languagesFigure 3; §3.1Figure 3
Sentiment association differences (FN/FP)Model-specific shifts: Llama/Gemma skew Western→negative; Qwen/Aya skew Asian→positiveNo systematic FN/FP differenceVisual differences reported in Figure 5 (aggregated across languages)Camellia-Grounded + Camellia-Neutral filled with 50 sampled entitiesFigure 5; §3.2Figure 5

What To Try In 7 Days

Run Camellia's small sample (a few contexts) in your target languages to spot obvious Western/Asian mismatches.

If you use an off-the-shelf model, compare at least two families (one with local provenance) on Camellia to surface model-specific sentiment or extraction errors.

Add simple post-rules or a shortlist filter for entity types (names, locations, foods) when context requires local entities.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Data coverage uneven: low-resource languages (e.g., Urdu) have fewer masked contexts and entities.

Binary cultural labeling (Asian vs Western) simplifies real cultural overlap and diasporic usage.

When Not To Use

Not a substitute for fine-grained cultural opinion or policy research that needs richer sociocultural labels.

Not appropriate for languages or cultures outside the nine covered.

Failure Modes

Rare or unseen local entities may be tokenized poorly and cause model failures unrelated to culture.

Using LLMs as judges on cultural correctness can inherit model-specific sentiment biases (judge bias).

Core Entities

Models

Llama3.3-70bQwen2.5-72bAya-expanse-32bGemma3-27b

Metrics

CBS (Cultural Bias Score)AccuracySentiment FN/FP differences

Datasets

Camellia (this paper)

Benchmarks

Camellia

Context Entities

Models

meta-llama/Llama-3.3-70B-InstructQwen/Qwen2.5-72B-InstructCohereForAI/aya-expanse-32bgoogle/gemma-3-27b-it

Metrics

Cohen's Kappa (annotation agreement)

Datasets

mC4 (used for extraction)Wikidata (entity source)