Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
LLMs systematically link group cues (age, looks, school, nationality) to unrelated good/bad traits; using them without checks can introduce subtle discrimination into hiring, evaluation, or recommendation workflows.
Summary TLDR
The authors build a templated dataset (11,940 fill-in-the-blank items) to test whether modern LLMs make broad positive/negative associations (not just narrow stereotypes) across four understudied domains: age, beauty, academic institutions, and nationality. They run two directions: SAI (stimulus→attribute) and ASA (attribute→stimulus). Four large models (GPT-4, PaLM-2, Llama-2-13B, Mistral-7B) plus GPT-3.5 show statistically significant correlations between stimulus polarity and generated attribute polarity. Beauty and institution biases are strongest; ageism varies by model; nationality bias is stronger when inferring nationality from traits. Dataset and code are on GitHub for benchmarking.
Problem Statement
LLMs can subtly transfer human social biases into outputs by linking a person’s group (age, looks, school, nationality) to unrelated positive or negative traits. These general polarity associations are understudied and can cause representational harms in real decisions like hiring.
Main Contribution
A semi-automated templating pipeline and a released dataset of 11,940 fill-in-the-blank items testing generalized positive/negative associations across age, beauty, institutions, and nationality.
A two-way task setup: SAI (stimulus→attribute) and ASA (attribute→stimulus) to measure bidirectional associations.
Empirical evaluation on four modern LLMs (GPT-4, PaLM-2, Llama-2-13B, Mistral-7B) showing statistically significant bias patterns and documenting model-dependent differences.
Key Findings
All evaluated models show non-random associations between stimulus polarity and generated attribute polarity.
Beauty-related bias is the strongest and most consistent across models.
Institutional bias is large: models predict positive traits for elite universities and predict elite institutions for positive traits.
Ageism appears but is model-dependent; one model (Mistral) often shows weaker or non-significant age effects.
Nationality bias is asymmetric: predicting attributes from nationality (SAI) is weaker for some models, but predicting nationality from traits (ASA) is significant across models.
Dataset size and composition: 11,940 instances covering age (2,154), beauty (3,684), institutions (3,600), nationality (2,502), with SAI and ASA splits.
Results
Kendall's τ (overall)
Beauty bias effect size
Dataset size
Who Should Care
What To Try In 7 Days
Run the released dataset against your production LLM to get baseline bias scores.
Add a simple guardrail: block or flag trait inferences that depend on sensitive tokens (age, nationality, institution, appearance).
Compare two model variants (original vs. instruction-tuned or filtered) using SAI/ASA tests to measure change.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Only four main LLM checkpoints plus GPT-3.5 tested; other models may differ.
- English-only templates; cross-lingual behavior is unknown.
- Stimuli grouping uses noisy proxies (GDP, college rankings, wage bands) that can misrepresent nuance.
- Template design and limited vocabulary choices may affect generality of results.
When Not To Use
- As a parity test for culture-specific biases in non-English settings.
- To claim that a given model is safe for all sensitive downstream tasks without additional auditing.
- As a mitigation; the benchmark detects bias but does not fix it.
Failure Modes
- Models produce non-option or out-of-context completions (numeric selection, free-text completions).
- Templates may induce grammatical cues that influence model choice rather than semantics.
- Proxy groupings (e.g., GDP → nationality positivity) can mask other confounds.
Core Entities
Models
- GPT-4
- PaLM-2
- Llama-2-13B
- Mistral-7B
- GPT-3.5
Metrics
- Kendall's τ
- conditional likelihoods (PPL, PNL, NPL, NNL, PNuL, NNuL)
Datasets
- SubtlerBiases dataset (11,940 instances)

