LLMs give worse, withholding, and sometimes condescending answers to users with low English, less formal education, or non‑US origin.

June 25, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.6

Citation Count

1

Authors

Elinor Poole-Dayan, Deb Roy, Jad Kabbara

Links

Abstract / PDF

Why It Matters For Business

Personalization or stored user profiles can make models give worse, withholding, or patronizing answers to already vulnerable groups, risking harm, trust loss, and regulatory exposure.

Summary TLDR

The authors test GPT-4, Claude 3 Opus, and Llama 3‑8B on two public QA datasets (TruthfulQA and SciQ) while prepending short user bios that vary by English proficiency, education, and country. All three models often drop accuracy and/or increase refusals for users who are non‑native English speakers, have lower formal education, or come from outside the US. Claude shows the largest, most consistent drops and a much higher refusal rate for low‑education foreign users (≈10.9% refusals vs 3.6% control). The work highlights a realistic risk: personalization or stored user profiles could systematically give worse or patronizing answers to vulnerable groups.

Problem Statement

Do widely used LLMs change answer quality when told simple user traits (English skill, education level, country)? Specifically, do accuracy, truthfulness, and refusal behavior degrade for less educated, non‑native English, or non‑US users, and do effects compound at intersections?

Main Contribution

Designed controlled "bios" (user personas) to test how simple user traits affect LLM answers.

Evaluated GPT‑4, Claude 3 Opus, and Llama 3‑8B on TruthfulQA (truthfulness) and SciQ (factuality) with and without bios.

Showed systematic drops in accuracy and higher refusal/condescension rates for low‑education, non‑native English, and some non‑US bios.

Documented compounded harms at intersections (e.g., low education + non‑native English + certain countries).

Manually analyzed refusal language and identified frequent condescending or patronizing responses in one model.

Key Findings

Accuracy falls for lower‑education bios across models, with Claude showing large drops.

NumbersClaude TruthfulQA: control 78.17% → Iran low‑edu 66.22% (−11.95 pts)

Non‑native English bios reduce accuracy for all models on TruthfulQA.

NumbersTruthfulQA: all models show statistically significant lower accuracy for non‑native speakers (p<0.05)

Refusal rate and condescending responses rise sharply for low‑education foreign bios in Claude.

NumbersClaude refusals: control 3.61% → foreign low‑edu 10.9% (+7.29 pts); 43.74% of those refusals labeled condescending

Effects compound at intersections: worst performance when low education and non‑native English co‑occur.

NumbersLargest drops observed for users who are both non‑native and less educated (multiple significance tests reported)

Results

Accuracy

ValueGPT-4 81.00%, Claude 78.17%, Llama3 44.11%

Accuracy

ValueGPT-4 96.17%, Claude 95.60%, Llama3 88.70%

Claude refusal rate

ValueControl 3.61% → Foreign low‑edu 10.9%

BaselineControl 3.61%

Claude TruthfulQA worst drop example

ValueControl 78.17% → Iran low‑edu 66.22%

BaselineControl 78.17%

Who Should Care

What To Try In 7 Days

Run persona tests: prepend bios reflecting different education, English levels, and countries and compare accuracy and refusals.

Log and audit refusal rates and answer wording by persona; flag high refusal + condescension patterns.

Disable or standardize personalization for high‑risk info channels until fairness checks pass.

Agent Features

Memory

  • persona/memory prompts (bios used to simulate stored user info)

Reproducibility

Data Urls

  • TruthfulQA
  • SciQ

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Controlled bios are partly LLM‑generated and may caricature or exaggerate traits.
  • Only three models tested (two closed, one open) — results may not generalize to all models.
  • Limited country set and education/English axes; real users have richer, noisier signals.
  • Setup (explicit bios prepended) is a simplified proxy for real personalization or inferred traits.

When Not To Use

  • Do not assume fairness if a model passes generic benchmarks without persona testing.
  • Avoid deploying aggressive personalization or memory features for high‑stakes information without audits.
  • Don't rely on a single model's behavior; test across multiple model versions and settings.

Failure Modes

  • Withholding correct answers selectively for certain personas.
  • Producing incorrect or misleading answers more often for low‑education or non‑native English users.
  • Using patronizing or mocking language in refusals or explanations.
  • Compound harms when multiple vulnerable traits co‑occur.

Core Entities

Models

  • GPT-4
  • Claude 3 Opus
  • Llama 3-8B

Metrics

  • Accuracy
  • refusal rate (%)
  • statistical significance (Chi-square p-values)

Datasets

  • TruthfulQA
  • SciQ