LLMs give worse, withholding, and sometimes condescending answers to users with low English, less formal education, or non‑US origin.

June 25, 20247 min

Overview

Decision SnapshotNeeds Validation

The study uses public datasets and clear experiments across three popular models, giving moderately strong evidence that personalization-like cues change outcomes; limitations in simulated personas and model coverage reduce generality.

Citations1

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 30%

Authors

Elinor Poole-Dayan, Deb Roy, Jad Kabbara

Links

Abstract / PDF / Data

Why It Matters For Business

Personalization or stored user profiles can make models give worse, withholding, or patronizing answers to already vulnerable groups, risking harm, trust loss, and regulatory exposure.

Who Should Care

Summary TLDR

The authors test GPT-4, Claude 3 Opus, and Llama 3‑8B on two public QA datasets (TruthfulQA and SciQ) while prepending short user bios that vary by English proficiency, education, and country. All three models often drop accuracy and/or increase refusals for users who are non‑native English speakers, have lower formal education, or come from outside the US. Claude shows the largest, most consistent drops and a much higher refusal rate for low‑education foreign users (≈10.9% refusals vs 3.6% control). The work highlights a realistic risk: personalization or stored user profiles could systematically give worse or patronizing answers to vulnerable groups.

Problem Statement

Do widely used LLMs change answer quality when told simple user traits (English skill, education level, country)? Specifically, do accuracy, truthfulness, and refusal behavior degrade for less educated, non‑native English, or non‑US users, and do effects compound at intersections?

Main Contribution

Designed controlled "bios" (user personas) to test how simple user traits affect LLM answers.

Evaluated GPT‑4, Claude 3 Opus, and Llama 3‑8B on TruthfulQA (truthfulness) and SciQ (factuality) with and without bios.

Key Findings

Accuracy falls for lower‑education bios across models, with Claude showing large drops.

NumbersClaude TruthfulQA: control 78.17% → Iran low‑edu 66.22% (−11.95 pts)

Practical UseIf your system personalizes or stores education-like traits, expect measurable accuracy loss for some users; audit model outputs by education level and add verification for answers shown to lower‑education personas.

Evidence RefTable 1 (Claude percent correct) and Section 5.1

Non‑native English bios reduce accuracy for all models on TruthfulQA.

NumbersTruthfulQA: all models show statistically significant lower accuracy for non‑native speakers (p<0.05)

Practical UseTest your model with non‑native English prompts. Do not assume English‑centric tuning generalizes; add multilingual or robustness checks.

Evidence RefFigure 1a and Section 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 81.00%, Claude 78.17%, Llama3 44.11%TruthfulQA (overall)Table 1 and Table 3Table 1
AccuracyGPT-4 96.17%, Claude 95.60%, Llama3 88.70%SciQ (overall)Table 1 and Table 3Table 1

What To Try In 7 Days

Run persona tests: prepend bios reflecting different education, English levels, and countries and compare accuracy and refusals.

Log and audit refusal rates and answer wording by persona; flag high refusal + condescension patterns.

Disable or standardize personalization for high‑risk info channels until fairness checks pass.

Agent Features

Memory
persona/memory prompts (bios used to simulate stored user info)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

TruthfulQASciQ

Risks & Boundaries

Limitations

Controlled bios are partly LLM‑generated and may caricature or exaggerate traits.

Only three models tested (two closed, one open) — results may not generalize to all models.

When Not To Use

Do not assume fairness if a model passes generic benchmarks without persona testing.

Avoid deploying aggressive personalization or memory features for high‑stakes information without audits.

Failure Modes

Withholding correct answers selectively for certain personas.

Producing incorrect or misleading answers more often for low‑education or non‑native English users.

Core Entities

Models

GPT-4Claude 3 OpusLlama 3-8B

Metrics

Accuracyrefusal rate (%)statistical significance (Chi-square p-values)

Datasets

TruthfulQASciQ