Overview
Production Readiness
0.4
Novelty Score
0.3
Cost Impact Score
0.2
Citation Count
0
Why It Matters For Business
Multilingual LLMs can make culturally wrong or unfair choices in non-English settings. This affects product trust, moderation, search relevance, and personalization in Asia. Model selection and region-specific testing matter.
Summary TLDR
Camellia is a labeled benchmark (19,530 entities + 2,173 masked social-media contexts) for measuring whether multilingual LLMs prefer Western-associated entities over Asian ones across nine Asian languages. Tests on four open multilingual LLMs show systematic problems: models favor Western entities 30–40% of the time in grounded contexts, show model-specific sentiment biases, and have large extraction accuracy gaps (≈10–20%) in non‑English settings.
Problem Statement
LLMs can favor Western-associated entities and fail to adapt to local cultural contexts. There was no large multilingual, entity-focused benchmark for several Asian languages to measure these cultural biases systematically.
Main Contribution
Camellia dataset: 19,530 manually annotated entities (Asian vs Western) across six entity types and 2,173 naturally occurring masked contexts from social media in 9 Asian languages with English parallels.
Evaluation suite and metrics: likelihood-based Cultural Bias Score (CBS), sentiment-association tests, and extractive QA setups to measure entity-centric cultural bias.
Empirical study: run CBS, sentiment, and extractive QA evaluations on four open multilingual LLM families (Llama3.3-70b, Qwen2.5-72b, Aya-expanse-32b, Gemma3-27b) and analyze language- and model-specific patterns.
Key Findings
LLMs often prefer Western entities even when the context requires an Asian entity.
Sentiment outputs shift depending on which model family is used.
Extractive QA accuracy shows large culture gaps in native Asian languages but small gaps in English.
Models trained or developed with stronger local data do better on related languages.
Benchmark is high quality but unevenly distributed across languages.
Results
Cultural Bias Score (CBS)
Sentiment association differences (FN/FP)
Accuracy
Annotation agreement
Who Should Care
What To Try In 7 Days
Run Camellia's small sample (a few contexts) in your target languages to spot obvious Western/Asian mismatches.
If you use an off-the-shelf model, compare at least two families (one with local provenance) on Camellia to surface model-specific sentiment or extraction errors.
Add simple post-rules or a shortlist filter for entity types (names, locations, foods) when context requires local entities.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Data coverage uneven: low-resource languages (e.g., Urdu) have fewer masked contexts and entities.
- Binary cultural labeling (Asian vs Western) simplifies real cultural overlap and diasporic usage.
- Masked social-media contexts come from X and were filtered for toxicity; this may bias topic distribution.
When Not To Use
- Not a substitute for fine-grained cultural opinion or policy research that needs richer sociocultural labels.
- Not appropriate for languages or cultures outside the nine covered.
- Not a mitigation recipe—benchmark flags issues but does not provide automatic fixes.
Failure Modes
- Rare or unseen local entities may be tokenized poorly and cause model failures unrelated to culture.
- Using LLMs as judges on cultural correctness can inherit model-specific sentiment biases (judge bias).
- Parallel English tests can mask problems that only appear in native-language usage.
Core Entities
Models
- Llama3.3-70b
- Qwen2.5-72b
- Aya-expanse-32b
- Gemma3-27b
Metrics
- CBS (Cultural Bias Score)
- Accuracy
- Sentiment FN/FP differences
Datasets
- Camellia (this paper)
Benchmarks
- Camellia
Context Entities
Models
- meta-llama/Llama-3.3-70B-Instruct
- Qwen/Qwen2.5-72B-Instruct
- CohereForAI/aya-expanse-32b
- google/gemma-3-27b-it
Metrics
- Cohen's Kappa (annotation agreement)
Datasets
- mC4 (used for extraction)
- Wikidata (entity source)

