Camellia: a new benchmark that measures how multilingual LLMs favor Western vs Asian entities in nine Asian languages

Overview

Decision SnapshotNeeds Validation

Dataset is a large, well-annotated benchmark with solid annotation agreement. Results cover four open multilingual models and multiple evaluation setups. Low-resource languages have fewer contexts, reducing power there.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 40%

Novelty: 30%

Authors

Tarek Naous, Anagha Savit, Carlos Rafael Catalan, Geyang Guo, Jaehyeok Lee, Kyungdon Lee, Lheane Marie Dizon, Mengyu Ye, Neel Kothari, Sahajpreet Singh, Sarah Masud, Tanish Patwa, Trung Thanh Tran, Zohaib Khan, Alan Ritter, JinYeong Bak, Keisuke Sakaguchi, Tanmoy Chakraborty, Yuki Arase, Wei Xu

Links

Abstract / PDF / Data

Why It Matters For Business

Multilingual LLMs can make culturally wrong or unfair choices in non-English settings. This affects product trust, moderation, search relevance, and personalization in Asia. Model selection and region-specific testing matter.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

Camellia is a labeled benchmark (19,530 entities + 2,173 masked social-media contexts) for measuring whether multilingual LLMs prefer Western-associated entities over Asian ones across nine Asian languages. Tests on four open multilingual LLMs show systematic problems: models favor Western entities 30–40% of the time in grounded contexts, show model-specific sentiment biases, and have large extraction accuracy gaps (≈10–20%) in non‑English settings.

Problem Statement

LLMs can favor Western-associated entities and fail to adapt to local cultural contexts. There was no large multilingual, entity-focused benchmark for several Asian languages to measure these cultural biases systematically.

Main Contribution

Camellia dataset: 19,530 manually annotated entities (Asian vs Western) across six entity types and 2,173 naturally occurring masked contexts from social media in 9 Asian languages with English parallels.

Evaluation suite and metrics: likelihood-based Cultural Bias Score (CBS), sentiment-association tests, and extractive QA setups to measure entity-centric cultural bias.

Key Findings

LLMs often prefer Western entities even when the context requires an Asian entity.

NumbersCBS ≈ 30–40% on culturally-grounded contexts (expected ~5%)

Practical UseDon't assume multilingual LLMs pick locally appropriate people/places/foods; add checks or local-data fine-tuning before deploying in non-Western languages.

Evidence RefFigure 3; §3.1

Sentiment outputs shift depending on which model family is used.

NumbersModel-specific FN/FP differences visible in Figure 5 (Llama/Gemma bias toward Western negativity; Qwen/Aya toward Asian/

Practical UseModel choice matters for content-moderation or sentiment tasks: test the exact model + language combination on local data before trusting automated moderation.

Evidence RefFigure 5; §3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Cultural Bias Score (CBS)	Models assign Western entities higher likelihood ~30–40% on grounded contexts	Expected ~5% for grounded contexts	≈+25–35pp over expectation	Camellia-Grounded across 9 languages	Figure 3; §3.1	Figure 3
Sentiment association differences (FN/FP)	Model-specific shifts: Llama/Gemma skew Western→negative; Qwen/Aya skew Asian→positive	No systematic FN/FP difference	Visual differences reported in Figure 5 (aggregated across languages)	Camellia-Grounded + Camellia-Neutral filled with 50 sampled entities	Figure 5; §3.2	Figure 5

What To Try In 7 Days

Run Camellia's small sample (a few contexts) in your target languages to spot obvious Western/Asian mismatches.

If you use an off-the-shelf model, compare at least two families (one with local provenance) on Camellia to surface model-specific sentiment or extraction errors.

Add simple post-rules or a shortlist filter for entity types (names, locations, foods) when context requires local entities.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://arxiv.org/abs/2510.05291 (paper states benchmark will be publicly released)

Risks & Boundaries

Limitations

Data coverage uneven: low-resource languages (e.g., Urdu) have fewer masked contexts and entities.

Binary cultural labeling (Asian vs Western) simplifies real cultural overlap and diasporic usage.

When Not To Use

Not a substitute for fine-grained cultural opinion or policy research that needs richer sociocultural labels.

Not appropriate for languages or cultures outside the nine covered.

Failure Modes

Rare or unseen local entities may be tokenized poorly and cause model failures unrelated to culture.

Using LLMs as judges on cultural correctness can inherit model-specific sentiment biases (judge bias).

Core Entities

Models

Llama3.3-70bQwen2.5-72bAya-expanse-32bGemma3-27b

Metrics

CBS (Cultural Bias Score)AccuracySentiment FN/FP differences

Datasets

Camellia (this paper)

Benchmarks

Camellia

Context Entities

Models

meta-llama/Llama-3.3-70B-InstructQwen/Qwen2.5-72B-InstructCohereForAI/aya-expanse-32bgoogle/gemma-3-27b-it

Metrics

Cohen's Kappa (annotation agreement)

Datasets

mC4 (used for extraction)Wikidata (entity source)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs often prefer Western entities even when the context requires an Asian entity.

Sentiment outputs shift depending on which model family is used.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding