LLMs often answer with English-culture content when asked in other languages

October 19, 20238 min

Overview

Decision SnapshotReady For Pilot

The benchmark is practical and the experiments are clear; evidence shows strong effects on concrete items and consistent English-anchoring on opinion surveys, but scope is limited to 11 languages and selected objects.

Citations5

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, Michael R. Lyu

Links

Abstract / PDF

Why It Matters For Business

Unlocalized LLM outputs frustrate non-English users, harm trust and product adoption, and can cause reputational or regulatory risk if cultural mismatches appear in customer-facing content.

Who Should Care

Summary TLDR

The paper builds a multilingual benchmark to show that popular GPT-family models (text-davinci-003, ChatGPT, GPT-4) tend to produce English-culture answers when prompted in non-English languages. The authors quantify this with an In-Culture Score (concrete items like holidays) and Euclidean distances on two cross-cultural surveys (abstract values). Findings: ChatGPT is heavily English-dominated for non-English queries (avg in-culture 1.4/10), GPT-4 is even more peaked toward English, and older text-davinci-003 is less dominated. Two practical fixes work: (1) pretrain on more balanced non-English data (example: ERNIE greatly improved Chinese outputs), and (2) a cheap deployment trick—explicit

Problem Statement

Large LLMs are trained mostly on English data. When non-English users ask subjective or culture-specific questions, the models often reply with items and opinions tied to English culture rather than the user’s culture. This mismatch risks poor user experience, cultural erasure, and biased downstream decisions.

Main Contribution

Constructed a multilingual benchmark for cultural dominance: 8 concrete object types (holidays, songs, books, movies, celebrities, heroes, history, mountains) across 11 languages and two public opinion surveys (World Values Survey and Political Coordinates Test).

Measured cultural dominance with two simple metrics: In-Culture Score for concrete items and Euclidean distance to human-survey baselines for abstract values.

Key Findings

ChatGPT’s concrete outputs are English-dominated for non-English queries

NumbersIn-Culture Score: English 7.3 vs non-English avg 1.4 (ChatGPT, holidays & related objects)

Practical UseIf you deploy ChatGPT without localization, non-English users will often get English-centric answers; test and patch before shipping.

Evidence RefTable 3(a), Table 12

GPT family became more English-dominant over model versions

NumbersNon-English avg In-Culture: text-davinci-003 3.1 → ChatGPT 1.4 → GPT-4 1.2

Practical UseNewer base models may be more aligned to English-heavy safety/clarity data; do not assume newer = better for localization.

Evidence RefTable 3(a)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
In-Culture Score (ChatGPT, avg over concrete objects)English 7.3, Non-English avg 1.4Higher is better (max 10 per item list)Non-English much lower than EnglishConcrete objects (holidays, songs, books, movies, celebrities, heroes, history, mountains)Table 3(a), Table 12Table 3(a)
In-Culture Score (text-davinci-003 vs GPT family, Non-English avg)text-davinci-003 3.1; ChatGPT 1.4; GPT-4 1.2Higher is betterOlder model less dominatedConcrete objects, Non-English languages combinedTable 3(a)Table 3(a)

What To Try In 7 Days

Run the paper’s In-Culture Score on your product languages for a quick audit.

Add an explicit culture token in prompts (e.g., 'In the culture of [Chinese], {query}') for concrete info tasks and measure improvement.

Review your training/finetune data mix by language; flag critical user languages for extra data or fine-tuning.

Optimization Features

Training Optimization
Increase non-English pretraining data share

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Scope: only eight concrete object types and eleven languages; not exhaustive of world cultures.

Abstract-value benchmarks rely on existing surveys (WVS, PCT) that have their own sampling and topical biases.

When Not To Use

When you need dialect-level cultural nuance inside a single language—this benchmark treats language as culture proxy.

To evaluate highly technical factual tasks where culture is irrelevant.

Failure Modes

Producing culturally inappropriate items (e.g., listing Thanksgiving for Chinese queries).

Prompting that is ambiguous (P2) can be ignored by the model.

Core Entities

Models

text-davinci-003ChatGPTGPT-4GPT-3.5-turboERNIE (Baidi/Yiyan)GPT-4-1106

Metrics

In-Culture Score (concrete items)Euclidean distance to human-survey results (abstract opinions)

Datasets

World Values Survey (WVS)Political Coordinates Test (PCT)Wikipedia (used to tag item cultural origin)FLoRes (translation BLEU used for language competence checks)

Benchmarks

Multilingual concrete cultural objects set (8 object types, 11 languages)WVS/PCT-based abstract values benchmark

Context Entities

Models

RLHF-trained GPT family models

Datasets

Common Crawl (pretraining background for GPT-family cited)