LLMs give worse, withholding, and sometimes condescending answers to users with low English, less formal education, or non‑US origin.

Overview

Decision SnapshotNeeds Validation

The study uses public datasets and clear experiments across three popular models, giving moderately strong evidence that personalization-like cues change outcomes; limitations in simulated personas and model coverage reduce generality.

Citations1

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 30%

Authors

Elinor Poole-Dayan, Deb Roy, Jad Kabbara

Links

Abstract / PDF / Data

Why It Matters For Business

Personalization or stored user profiles can make models give worse, withholding, or patronizing answers to already vulnerable groups, risking harm, trust loss, and regulatory exposure.

Who Should Care

CTO Product Manager ML Engineer CEO Founder

Summary TLDR

The authors test GPT-4, Claude 3 Opus, and Llama 3‑8B on two public QA datasets (TruthfulQA and SciQ) while prepending short user bios that vary by English proficiency, education, and country. All three models often drop accuracy and/or increase refusals for users who are non‑native English speakers, have lower formal education, or come from outside the US. Claude shows the largest, most consistent drops and a much higher refusal rate for low‑education foreign users (≈10.9% refusals vs 3.6% control). The work highlights a realistic risk: personalization or stored user profiles could systematically give worse or patronizing answers to vulnerable groups.

Problem Statement

Do widely used LLMs change answer quality when told simple user traits (English skill, education level, country)? Specifically, do accuracy, truthfulness, and refusal behavior degrade for less educated, non‑native English, or non‑US users, and do effects compound at intersections?

Main Contribution

Designed controlled "bios" (user personas) to test how simple user traits affect LLM answers.

Evaluated GPT‑4, Claude 3 Opus, and Llama 3‑8B on TruthfulQA (truthfulness) and SciQ (factuality) with and without bios.

Key Findings

Accuracy falls for lower‑education bios across models, with Claude showing large drops.

NumbersClaude TruthfulQA: control 78.17% → Iran low‑edu 66.22% (−11.95 pts)

Practical UseIf your system personalizes or stores education-like traits, expect measurable accuracy loss for some users; audit model outputs by education level and add verification for answers shown to lower‑education personas.

Evidence RefTable 1 (Claude percent correct) and Section 5.1

Non‑native English bios reduce accuracy for all models on TruthfulQA.

NumbersTruthfulQA: all models show statistically significant lower accuracy for non‑native speakers (p<0.05)

Practical UseTest your model with non‑native English prompts. Do not assume English‑centric tuning generalizes; add multilingual or robustness checks.

Evidence RefFigure 1a and Section 5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 81.00%, Claude 78.17%, Llama3 44.11%	—	—	TruthfulQA (overall)	Table 1 and Table 3	Table 1
Accuracy	GPT-4 96.17%, Claude 95.60%, Llama3 88.70%	—	—	SciQ (overall)	Table 1 and Table 3	Table 1

What To Try In 7 Days

Run persona tests: prepend bios reflecting different education, English levels, and countries and compare accuracy and refusals.

Log and audit refusal rates and answer wording by persona; flag high refusal + condescension patterns.

Disable or standardize personalization for high‑risk info channels until fairness checks pass.

Agent Features

Memory

persona/memory prompts (bios used to simulate stored user info)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

TruthfulQASciQ

Risks & Boundaries

Limitations

Controlled bios are partly LLM‑generated and may caricature or exaggerate traits.

Only three models tested (two closed, one open) — results may not generalize to all models.

When Not To Use

Do not assume fairness if a model passes generic benchmarks without persona testing.

Avoid deploying aggressive personalization or memory features for high‑stakes information without audits.

Failure Modes

Withholding correct answers selectively for certain personas.

Producing incorrect or misleading answers more often for low‑education or non‑native English users.

Core Entities

Models

GPT-4Claude 3 OpusLlama 3-8B

Metrics

Accuracyrefusal rate (%)statistical significance (Chi-square p-values)

Datasets

TruthfulQASciQ

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Accuracy falls for lower‑education bios across models, with Claude showing large drops.

Non‑native English bios reduce accuracy for all models on TruthfulQA.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding