Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
Unlocalized LLM outputs frustrate non-English users, harm trust and product adoption, and can cause reputational or regulatory risk if cultural mismatches appear in customer-facing content.
Summary TLDR
The paper builds a multilingual benchmark to show that popular GPT-family models (text-davinci-003, ChatGPT, GPT-4) tend to produce English-culture answers when prompted in non-English languages. The authors quantify this with an In-Culture Score (concrete items like holidays) and Euclidean distances on two cross-cultural surveys (abstract values). Findings: ChatGPT is heavily English-dominated for non-English queries (avg in-culture 1.4/10), GPT-4 is even more peaked toward English, and older text-davinci-003 is less dominated. Two practical fixes work: (1) pretrain on more balanced non-English data (example: ERNIE greatly improved Chinese outputs), and (2) a cheap deployment trick—explicit
Problem Statement
Large LLMs are trained mostly on English data. When non-English users ask subjective or culture-specific questions, the models often reply with items and opinions tied to English culture rather than the user’s culture. This mismatch risks poor user experience, cultural erasure, and biased downstream decisions.
Main Contribution
Constructed a multilingual benchmark for cultural dominance: 8 concrete object types (holidays, songs, books, movies, celebrities, heroes, history, mountains) across 11 languages and two public opinion surveys (World Values Survey and Political Coordinates Test).
Measured cultural dominance with two simple metrics: In-Culture Score for concrete items and Euclidean distance to human-survey baselines for abstract values.
Empirical analysis across GPT-family models showing English cultural dominance, its evolution across model versions, and two mitigation strategies (diverse pretraining and culture-aware prompting).
Key Findings
ChatGPT’s concrete outputs are English-dominated for non-English queries
GPT family became more English-dominant over model versions
Simple mitigation methods substantially reduce English dominance
Abstract cultural opinions remain close to English anchors despite language
Results
In-Culture Score (ChatGPT, avg over concrete objects)
In-Culture Score (text-davinci-003 vs GPT family, Non-English avg)
Prompting effect (ChatGPT)
Pretraining effect (ERNIE vs GPT-4, Chinese)
Abstract opinion alignment (ChatGPT, WVS/PCT)
Who Should Care
What To Try In 7 Days
Run the paper’s In-Culture Score on your product languages for a quick audit.
Add an explicit culture token in prompts (e.g., 'In the culture of [Chinese], {query}') for concrete info tasks and measure improvement.
Review your training/finetune data mix by language; flag critical user languages for extra data or fine-tuning.
Optimization Features
Training Optimization
- Increase non-English pretraining data share
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Scope: only eight concrete object types and eleven languages; not exhaustive of world cultures.
- Abstract-value benchmarks rely on existing surveys (WVS, PCT) that have their own sampling and topical biases.
- Black-box evaluation on closed models; causes of bias (data vs alignment) are discussed but not causally isolated.
When Not To Use
- When you need dialect-level cultural nuance inside a single language—this benchmark treats language as culture proxy.
- To evaluate highly technical factual tasks where culture is irrelevant.
Failure Modes
- Producing culturally inappropriate items (e.g., listing Thanksgiving for Chinese queries).
- Prompting that is ambiguous (P2) can be ignored by the model.
- Abstract opinion alignment resists simple prompting; model still echoes English-centered opinion anchors.
Core Entities
Models
- text-davinci-003
- ChatGPT
- GPT-4
- GPT-3.5-turbo
- ERNIE (Baidi/Yiyan)
- GPT-4-1106
Metrics
- In-Culture Score (concrete items)
- Euclidean distance to human-survey results (abstract opinions)
Datasets
- World Values Survey (WVS)
- Political Coordinates Test (PCT)
- Wikipedia (used to tag item cultural origin)
- FLoRes (translation BLEU used for language competence checks)
Benchmarks
- Multilingual concrete cultural objects set (8 object types, 11 languages)
- WVS/PCT-based abstract values benchmark
Context Entities
Models
- RLHF-trained GPT family models
Datasets
- Common Crawl (pretraining background for GPT-family cited)

