LLMs link age, beauty, school, and nationality to unrelated good/bad traits

Overview

Decision SnapshotNeeds Validation

The dataset and tests are ready for benchmarking but not a mitigation solution; findings are statistically robust across models and splits.

Citations2

Evidence Strength0.90

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Mahammed Kamruzzaman, Md. Minul Islam Shovon, Gene Louis Kim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs systematically link group cues (age, looks, school, nationality) to unrelated good/bad traits; using them without checks can introduce subtle discrimination into hiring, evaluation, or recommendation workflows.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

The authors build a templated dataset (11,940 fill-in-the-blank items) to test whether modern LLMs make broad positive/negative associations (not just narrow stereotypes) across four understudied domains: age, beauty, academic institutions, and nationality. They run two directions: SAI (stimulus→attribute) and ASA (attribute→stimulus). Four large models (GPT-4, PaLM-2, Llama-2-13B, Mistral-7B) plus GPT-3.5 show statistically significant correlations between stimulus polarity and generated attribute polarity. Beauty and institution biases are strongest; ageism varies by model; nationality bias is stronger when inferring nationality from traits. Dataset and code are on GitHub for benchmarking.

Problem Statement

LLMs can subtly transfer human social biases into outputs by linking a person’s group (age, looks, school, nationality) to unrelated positive or negative traits. These general polarity associations are understudied and can cause representational harms in real decisions like hiring.

Main Contribution

A semi-automated templating pipeline and a released dataset of 11,940 fill-in-the-blank items testing generalized positive/negative associations across age, beauty, institutions, and nationality.

A two-way task setup: SAI (stimulus→attribute) and ASA (attribute→stimulus) to measure bidirectional associations.

Key Findings

All evaluated models show non-random associations between stimulus polarity and generated attribute polarity.

NumbersTable 1: Kendall's τ SAI/ASA GPT-4 = 0.407 / 0.372 (p≈4.7e-235,1.18e-145)

Practical UseExpect LLM outputs to reflect broad positive/negative associations; audit generated traits when models influence decisions.

Evidence RefTable 1

Beauty-related bias is the strongest and most consistent across models.

NumbersTable 2: GPT-4 beauty τ(SAI)=0.870 (p≈9.6e-147); ASA τ=0.772

Practical UseAvoid using raw LLM persona inferences about attractiveness in screening or profiling; apply filters or human review.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Kendall's τ (overall)	GPT-4 SAI=0.407, ASA=0.372	—	—	All bias types combined	Table 1 shows τ and p-values rejecting null in all model/direction settings	Table 1
Beauty bias effect size	GPT-4 SAI τ=0.870 (p≈9.6e-147)	—	—	Beauty domain	Table 2 reports strongest τ values for beauty across models	Table 2

What To Try In 7 Days

Run the released dataset against your production LLM to get baseline bias scores.

Add a simple guardrail: block or flag trait inferences that depend on sensitive tokens (age, nationality, institution, appearance).

Compare two model variants (original vs. instruction-tuned or filtered) using SAI/ASA tests to measure change.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/kamruzzaman15/Identifying-Subtler-Biases-in-LLMs

Data URLs

https://github.com/kamruzzaman15/Identifying-Subtler-Biases-in-LLMs

Risks & Boundaries

Limitations

Only four main LLM checkpoints plus GPT-3.5 tested; other models may differ.

English-only templates; cross-lingual behavior is unknown.

When Not To Use

As a parity test for culture-specific biases in non-English settings.

To claim that a given model is safe for all sensitive downstream tasks without additional auditing.

Failure Modes

Models produce non-option or out-of-context completions (numeric selection, free-text completions).

Templates may induce grammatical cues that influence model choice rather than semantics.

Core Entities

Models

GPT-4PaLM-2Llama-2-13BMistral-7BGPT-3.5

Metrics

Kendall's τconditional likelihoods (PPL, PNL, NPL, NNL, PNuL, NNuL)

Datasets

SubtlerBiases dataset (11,940 instances)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

All evaluated models show non-random associations between stimulus polarity and generated attribute polarity.

Beauty-related bias is the strongest and most consistent across models.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

BIASSCOPE: an automated LLM-driven pipeline that finds evaluation biases and builds a tougher JudgeBench‑Pro

Key finding

Pairwise comparisons amplify stylistic distractions; absolute scoring is more robust

Key finding

Psychometric audit finds durable provider-level biases in LLMs that can compound across multi-model systems

Key finding

JudgeBiasBench: a 12-type benchmark and bias-aware training to reduce LLM-judge bias

Key finding

Alignment makes LLM evaluators overuse certain scores; prompt score range is a cheap fix

Key finding