LLMs often answer with English-culture content when asked in other languages

Overview

Decision SnapshotReady For Pilot

The benchmark is practical and the experiments are clear; evidence shows strong effects on concrete items and consistent English-anchoring on opinion surveys, but scope is limited to 11 languages and selected objects.

Citations5

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, Michael R. Lyu

Links

Abstract / PDF

Why It Matters For Business

Unlocalized LLM outputs frustrate non-English users, harm trust and product adoption, and can cause reputational or regulatory risk if cultural mismatches appear in customer-facing content.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Founder

Summary TLDR

The paper builds a multilingual benchmark to show that popular GPT-family models (text-davinci-003, ChatGPT, GPT-4) tend to produce English-culture answers when prompted in non-English languages. The authors quantify this with an In-Culture Score (concrete items like holidays) and Euclidean distances on two cross-cultural surveys (abstract values). Findings: ChatGPT is heavily English-dominated for non-English queries (avg in-culture 1.4/10), GPT-4 is even more peaked toward English, and older text-davinci-003 is less dominated. Two practical fixes work: (1) pretrain on more balanced non-English data (example: ERNIE greatly improved Chinese outputs), and (2) a cheap deployment trick—explicit

Problem Statement

Large LLMs are trained mostly on English data. When non-English users ask subjective or culture-specific questions, the models often reply with items and opinions tied to English culture rather than the user’s culture. This mismatch risks poor user experience, cultural erasure, and biased downstream decisions.

Main Contribution

Constructed a multilingual benchmark for cultural dominance: 8 concrete object types (holidays, songs, books, movies, celebrities, heroes, history, mountains) across 11 languages and two public opinion surveys (World Values Survey and Political Coordinates Test).

Measured cultural dominance with two simple metrics: In-Culture Score for concrete items and Euclidean distance to human-survey baselines for abstract values.

Key Findings

ChatGPT’s concrete outputs are English-dominated for non-English queries

NumbersIn-Culture Score: English 7.3 vs non-English avg 1.4 (ChatGPT, holidays & related objects)

Practical UseIf you deploy ChatGPT without localization, non-English users will often get English-centric answers; test and patch before shipping.

Evidence RefTable 3(a), Table 12

GPT family became more English-dominant over model versions

NumbersNon-English avg In-Culture: text-davinci-003 3.1 → ChatGPT 1.4 → GPT-4 1.2

Practical UseNewer base models may be more aligned to English-heavy safety/clarity data; do not assume newer = better for localization.

Evidence RefTable 3(a)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
In-Culture Score (ChatGPT, avg over concrete objects)	English 7.3, Non-English avg 1.4	Higher is better (max 10 per item list)	Non-English much lower than English	Concrete objects (holidays, songs, books, movies, celebrities, heroes, history, mountains)	Table 3(a), Table 12	Table 3(a)
In-Culture Score (text-davinci-003 vs GPT family, Non-English avg)	text-davinci-003 3.1; ChatGPT 1.4; GPT-4 1.2	Higher is better	Older model less dominated	Concrete objects, Non-English languages combined	Table 3(a)	Table 3(a)

What To Try In 7 Days

Run the paper’s In-Culture Score on your product languages for a quick audit.

Add an explicit culture token in prompts (e.g., 'In the culture of [Chinese], {query}') for concrete info tasks and measure improvement.

Review your training/finetune data mix by language; flag critical user languages for extra data or fine-tuning.

Optimization Features

Training Optimization

Increase non-English pretraining data share

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Scope: only eight concrete object types and eleven languages; not exhaustive of world cultures.

Abstract-value benchmarks rely on existing surveys (WVS, PCT) that have their own sampling and topical biases.

When Not To Use

When you need dialect-level cultural nuance inside a single language—this benchmark treats language as culture proxy.

To evaluate highly technical factual tasks where culture is irrelevant.

Failure Modes

Producing culturally inappropriate items (e.g., listing Thanksgiving for Chinese queries).

Prompting that is ambiguous (P2) can be ignored by the model.

Core Entities

Models

text-davinci-003ChatGPTGPT-4GPT-3.5-turboERNIE (Baidi/Yiyan)GPT-4-1106

Metrics

In-Culture Score (concrete items)Euclidean distance to human-survey results (abstract opinions)

Datasets

World Values Survey (WVS)Political Coordinates Test (PCT)Wikipedia (used to tag item cultural origin)FLoRes (translation BLEU used for language competence checks)

Benchmarks

Multilingual concrete cultural objects set (8 object types, 11 languages)WVS/PCT-based abstract values benchmark

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatGPT’s concrete outputs are English-dominated for non-English queries

GPT family became more English-dominant over model versions

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding