LLMs often answer with English-culture content when asked in other languages

October 19, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

5

Authors

Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, Michael R. Lyu

Links

Abstract / PDF

Why It Matters For Business

Unlocalized LLM outputs frustrate non-English users, harm trust and product adoption, and can cause reputational or regulatory risk if cultural mismatches appear in customer-facing content.

Summary TLDR

The paper builds a multilingual benchmark to show that popular GPT-family models (text-davinci-003, ChatGPT, GPT-4) tend to produce English-culture answers when prompted in non-English languages. The authors quantify this with an In-Culture Score (concrete items like holidays) and Euclidean distances on two cross-cultural surveys (abstract values). Findings: ChatGPT is heavily English-dominated for non-English queries (avg in-culture 1.4/10), GPT-4 is even more peaked toward English, and older text-davinci-003 is less dominated. Two practical fixes work: (1) pretrain on more balanced non-English data (example: ERNIE greatly improved Chinese outputs), and (2) a cheap deployment trick—explicit

Problem Statement

Large LLMs are trained mostly on English data. When non-English users ask subjective or culture-specific questions, the models often reply with items and opinions tied to English culture rather than the user’s culture. This mismatch risks poor user experience, cultural erasure, and biased downstream decisions.

Main Contribution

Constructed a multilingual benchmark for cultural dominance: 8 concrete object types (holidays, songs, books, movies, celebrities, heroes, history, mountains) across 11 languages and two public opinion surveys (World Values Survey and Political Coordinates Test).

Measured cultural dominance with two simple metrics: In-Culture Score for concrete items and Euclidean distance to human-survey baselines for abstract values.

Empirical analysis across GPT-family models showing English cultural dominance, its evolution across model versions, and two mitigation strategies (diverse pretraining and culture-aware prompting).

Key Findings

ChatGPT’s concrete outputs are English-dominated for non-English queries

NumbersIn-Culture Score: English 7.3 vs non-English avg 1.4 (ChatGPT, holidays & related objects)

GPT family became more English-dominant over model versions

NumbersNon-English avg In-Culture: text-davinci-003 3.1 → ChatGPT 1.4 → GPT-4 1.2

Simple mitigation methods substantially reduce English dominance

NumbersPrompting P1 raised ChatGPT non-English In-Culture avg from 1.4 → 9.9; ERNIE (diverse pretrain) Chinese holiday score 7.

Abstract cultural opinions remain close to English anchors despite language

NumbersChatGPT non-English Euclidean distance to human reference (WVS) 0.39 vs to English human 0.10

Results

In-Culture Score (ChatGPT, avg over concrete objects)

ValueEnglish 7.3, Non-English avg 1.4

BaselineHigher is better (max 10 per item list)

In-Culture Score (text-davinci-003 vs GPT family, Non-English avg)

Valuetext-davinci-003 3.1; ChatGPT 1.4; GPT-4 1.2

BaselineHigher is better

Prompting effect (ChatGPT)

ValueNone→P1: English 7.3→10.0; Non-English 1.4→9.9

BaselineIn-Culture Score

Pretraining effect (ERNIE vs GPT-4, Chinese)

ValueConcrete In-Culture Chinese: ERNIE 7.6 vs GPT-4 1.8; Abstract WVS distance Chinese: ERNIE 0.24 vs GPT-4 0.34

BaselineLower Euclidean distance better for abstract; higher In-Culture better for concrete

Abstract opinion alignment (ChatGPT, WVS/PCT)

ValueNon-English outputs closer to English human/model anchors than to local human references (e.g., WVS HRef 0.39 vs HEn 0.1

BaselineEuclidean distance to human survey results

Who Should Care

What To Try In 7 Days

Run the paper’s In-Culture Score on your product languages for a quick audit.

Add an explicit culture token in prompts (e.g., 'In the culture of [Chinese], {query}') for concrete info tasks and measure improvement.

Review your training/finetune data mix by language; flag critical user languages for extra data or fine-tuning.

Optimization Features

Training Optimization

  • Increase non-English pretraining data share

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Scope: only eight concrete object types and eleven languages; not exhaustive of world cultures.
  • Abstract-value benchmarks rely on existing surveys (WVS, PCT) that have their own sampling and topical biases.
  • Black-box evaluation on closed models; causes of bias (data vs alignment) are discussed but not causally isolated.

When Not To Use

  • When you need dialect-level cultural nuance inside a single language—this benchmark treats language as culture proxy.
  • To evaluate highly technical factual tasks where culture is irrelevant.

Failure Modes

  • Producing culturally inappropriate items (e.g., listing Thanksgiving for Chinese queries).
  • Prompting that is ambiguous (P2) can be ignored by the model.
  • Abstract opinion alignment resists simple prompting; model still echoes English-centered opinion anchors.

Core Entities

Models

  • text-davinci-003
  • ChatGPT
  • GPT-4
  • GPT-3.5-turbo
  • ERNIE (Baidi/Yiyan)
  • GPT-4-1106

Metrics

  • In-Culture Score (concrete items)
  • Euclidean distance to human-survey results (abstract opinions)

Datasets

  • World Values Survey (WVS)
  • Political Coordinates Test (PCT)
  • Wikipedia (used to tag item cultural origin)
  • FLoRes (translation BLEU used for language competence checks)

Benchmarks

  • Multilingual concrete cultural objects set (8 object types, 11 languages)
  • WVS/PCT-based abstract values benchmark

Context Entities

Models

  • RLHF-trained GPT family models

Datasets

  • Common Crawl (pretraining background for GPT-family cited)