LLMs link age, beauty, school, and nationality to unrelated good/bad traits

September 16, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

2

Authors

Mahammed Kamruzzaman, Md. Minul Islam Shovon, Gene Louis Kim

Links

Abstract / PDF

Why It Matters For Business

LLMs systematically link group cues (age, looks, school, nationality) to unrelated good/bad traits; using them without checks can introduce subtle discrimination into hiring, evaluation, or recommendation workflows.

Summary TLDR

The authors build a templated dataset (11,940 fill-in-the-blank items) to test whether modern LLMs make broad positive/negative associations (not just narrow stereotypes) across four understudied domains: age, beauty, academic institutions, and nationality. They run two directions: SAI (stimulus→attribute) and ASA (attribute→stimulus). Four large models (GPT-4, PaLM-2, Llama-2-13B, Mistral-7B) plus GPT-3.5 show statistically significant correlations between stimulus polarity and generated attribute polarity. Beauty and institution biases are strongest; ageism varies by model; nationality bias is stronger when inferring nationality from traits. Dataset and code are on GitHub for benchmarking.

Problem Statement

LLMs can subtly transfer human social biases into outputs by linking a person’s group (age, looks, school, nationality) to unrelated positive or negative traits. These general polarity associations are understudied and can cause representational harms in real decisions like hiring.

Main Contribution

A semi-automated templating pipeline and a released dataset of 11,940 fill-in-the-blank items testing generalized positive/negative associations across age, beauty, institutions, and nationality.

A two-way task setup: SAI (stimulus→attribute) and ASA (attribute→stimulus) to measure bidirectional associations.

Empirical evaluation on four modern LLMs (GPT-4, PaLM-2, Llama-2-13B, Mistral-7B) showing statistically significant bias patterns and documenting model-dependent differences.

Key Findings

All evaluated models show non-random associations between stimulus polarity and generated attribute polarity.

NumbersTable 1: Kendall's τ SAI/ASA GPT-4 = 0.407 / 0.372 (p≈4.7e-235,1.18e-145)

Beauty-related bias is the strongest and most consistent across models.

NumbersTable 2: GPT-4 beauty τ(SAI)=0.870 (p≈9.6e-147); ASA τ=0.772

Institutional bias is large: models predict positive traits for elite universities and predict elite institutions for positive traits.

NumbersTable 2: GPT-4 institutional τ(SAI)=0.573 (p≈2.9e-147); Llama-2 ASA τ=0.786

Ageism appears but is model-dependent; one model (Mistral) often shows weaker or non-significant age effects.

NumbersTable 2: GPT-4 age τ(SAI)=0.192 (p≈5.2e-09); Mistral age SAI p≈0.48 (fail)

Nationality bias is asymmetric: predicting attributes from nationality (SAI) is weaker for some models, but predicting nationality from traits (ASA) is significant across models.

NumbersTable 2: ASA nationality τ values significant for all models (e.g., Mistral τ=0.232, p≈1.3e-09)

Dataset size and composition: 11,940 instances covering age (2,154), beauty (3,684), institutions (3,600), nationality (2,502), with SAI and ASA splits.

NumbersSection 5.1: total 11,940; per-type counts listed

Results

Kendall's τ (overall)

ValueGPT-4 SAI=0.407, ASA=0.372

Beauty bias effect size

ValueGPT-4 SAI τ=0.870 (p≈9.6e-147)

Dataset size

Value11,940 instances

Who Should Care

What To Try In 7 Days

Run the released dataset against your production LLM to get baseline bias scores.

Add a simple guardrail: block or flag trait inferences that depend on sensitive tokens (age, nationality, institution, appearance).

Compare two model variants (original vs. instruction-tuned or filtered) using SAI/ASA tests to measure change.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Only four main LLM checkpoints plus GPT-3.5 tested; other models may differ.
  • English-only templates; cross-lingual behavior is unknown.
  • Stimuli grouping uses noisy proxies (GDP, college rankings, wage bands) that can misrepresent nuance.
  • Template design and limited vocabulary choices may affect generality of results.

When Not To Use

  • As a parity test for culture-specific biases in non-English settings.
  • To claim that a given model is safe for all sensitive downstream tasks without additional auditing.
  • As a mitigation; the benchmark detects bias but does not fix it.

Failure Modes

  • Models produce non-option or out-of-context completions (numeric selection, free-text completions).
  • Templates may induce grammatical cues that influence model choice rather than semantics.
  • Proxy groupings (e.g., GDP → nationality positivity) can mask other confounds.

Core Entities

Models

  • GPT-4
  • PaLM-2
  • Llama-2-13B
  • Mistral-7B
  • GPT-3.5

Metrics

  • Kendall's τ
  • conditional likelihoods (PPL, PNL, NPL, NNL, PNuL, NNuL)

Datasets

  • SubtlerBiases dataset (11,940 instances)