Overview
Dataset is a large, well-annotated benchmark with solid annotation agreement. Results cover four open multilingual models and multiple evaluation setups. Low-resource languages have fewer contexts, reducing power there.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 20%
Production readiness: 40%
Novelty: 30%
Why It Matters For Business
Multilingual LLMs can make culturally wrong or unfair choices in non-English settings. This affects product trust, moderation, search relevance, and personalization in Asia. Model selection and region-specific testing matter.
Who Should Care
Summary TLDR
Camellia is a labeled benchmark (19,530 entities + 2,173 masked social-media contexts) for measuring whether multilingual LLMs prefer Western-associated entities over Asian ones across nine Asian languages. Tests on four open multilingual LLMs show systematic problems: models favor Western entities 30–40% of the time in grounded contexts, show model-specific sentiment biases, and have large extraction accuracy gaps (≈10–20%) in non‑English settings.
Problem Statement
LLMs can favor Western-associated entities and fail to adapt to local cultural contexts. There was no large multilingual, entity-focused benchmark for several Asian languages to measure these cultural biases systematically.
Main Contribution
Camellia dataset: 19,530 manually annotated entities (Asian vs Western) across six entity types and 2,173 naturally occurring masked contexts from social media in 9 Asian languages with English parallels.
Evaluation suite and metrics: likelihood-based Cultural Bias Score (CBS), sentiment-association tests, and extractive QA setups to measure entity-centric cultural bias.
Key Findings
LLMs often prefer Western entities even when the context requires an Asian entity.
Sentiment outputs shift depending on which model family is used.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Cultural Bias Score (CBS) | Models assign Western entities higher likelihood ~30–40% on grounded contexts | Expected ~5% for grounded contexts | ≈+25–35pp over expectation | Camellia-Grounded across 9 languages | Figure 3; §3.1 | Figure 3 |
| Sentiment association differences (FN/FP) | Model-specific shifts: Llama/Gemma skew Western→negative; Qwen/Aya skew Asian→positive | No systematic FN/FP difference | Visual differences reported in Figure 5 (aggregated across languages) | Camellia-Grounded + Camellia-Neutral filled with 50 sampled entities | Figure 5; §3.2 | Figure 5 |
What To Try In 7 Days
Run Camellia's small sample (a few contexts) in your target languages to spot obvious Western/Asian mismatches.
If you use an off-the-shelf model, compare at least two families (one with local provenance) on Camellia to surface model-specific sentiment or extraction errors.
Add simple post-rules or a shortlist filter for entity types (names, locations, foods) when context requires local entities.
Reproducibility
Risks & Boundaries
Limitations
Data coverage uneven: low-resource languages (e.g., Urdu) have fewer masked contexts and entities.
Binary cultural labeling (Asian vs Western) simplifies real cultural overlap and diasporic usage.
When Not To Use
Not a substitute for fine-grained cultural opinion or policy research that needs richer sociocultural labels.
Not appropriate for languages or cultures outside the nine covered.
Failure Modes
Rare or unseen local entities may be tokenized poorly and cause model failures unrelated to culture.
Using LLMs as judges on cultural correctness can inherit model-specific sentiment biases (judge bias).

