Overview
The benchmark is practical and the experiments are clear; evidence shows strong effects on concrete items and consistent English-anchoring on opinion surveys, but scope is limited to 11 languages and selected objects.
Citations5
Evidence Strength0.80
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Unlocalized LLM outputs frustrate non-English users, harm trust and product adoption, and can cause reputational or regulatory risk if cultural mismatches appear in customer-facing content.
Who Should Care
Summary TLDR
The paper builds a multilingual benchmark to show that popular GPT-family models (text-davinci-003, ChatGPT, GPT-4) tend to produce English-culture answers when prompted in non-English languages. The authors quantify this with an In-Culture Score (concrete items like holidays) and Euclidean distances on two cross-cultural surveys (abstract values). Findings: ChatGPT is heavily English-dominated for non-English queries (avg in-culture 1.4/10), GPT-4 is even more peaked toward English, and older text-davinci-003 is less dominated. Two practical fixes work: (1) pretrain on more balanced non-English data (example: ERNIE greatly improved Chinese outputs), and (2) a cheap deployment trick—explicit
Problem Statement
Large LLMs are trained mostly on English data. When non-English users ask subjective or culture-specific questions, the models often reply with items and opinions tied to English culture rather than the user’s culture. This mismatch risks poor user experience, cultural erasure, and biased downstream decisions.
Main Contribution
Constructed a multilingual benchmark for cultural dominance: 8 concrete object types (holidays, songs, books, movies, celebrities, heroes, history, mountains) across 11 languages and two public opinion surveys (World Values Survey and Political Coordinates Test).
Measured cultural dominance with two simple metrics: In-Culture Score for concrete items and Euclidean distance to human-survey baselines for abstract values.
Key Findings
ChatGPT’s concrete outputs are English-dominated for non-English queries
GPT family became more English-dominant over model versions
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| In-Culture Score (ChatGPT, avg over concrete objects) | English 7.3, Non-English avg 1.4 | Higher is better (max 10 per item list) | Non-English much lower than English | Concrete objects (holidays, songs, books, movies, celebrities, heroes, history, mountains) | Table 3(a), Table 12 | Table 3(a) |
| In-Culture Score (text-davinci-003 vs GPT family, Non-English avg) | text-davinci-003 3.1; ChatGPT 1.4; GPT-4 1.2 | Higher is better | Older model less dominated | Concrete objects, Non-English languages combined | Table 3(a) | Table 3(a) |
What To Try In 7 Days
Run the paper’s In-Culture Score on your product languages for a quick audit.
Add an explicit culture token in prompts (e.g., 'In the culture of [Chinese], {query}') for concrete info tasks and measure improvement.
Review your training/finetune data mix by language; flag critical user languages for extra data or fine-tuning.
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Scope: only eight concrete object types and eleven languages; not exhaustive of world cultures.
Abstract-value benchmarks rely on existing surveys (WVS, PCT) that have their own sampling and topical biases.
When Not To Use
When you need dialect-level cultural nuance inside a single language—this benchmark treats language as culture proxy.
To evaluate highly technical factual tasks where culture is irrelevant.
Failure Modes
Producing culturally inappropriate items (e.g., listing Thanksgiving for Chinese queries).
Prompting that is ambiguous (P2) can be ignored by the model.

