Overview
The dataset and tests are ready for benchmarking but not a mitigation solution; findings are statistically robust across models and splits.
Citations2
Evidence Strength0.90
Confidence0.82
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
LLMs systematically link group cues (age, looks, school, nationality) to unrelated good/bad traits; using them without checks can introduce subtle discrimination into hiring, evaluation, or recommendation workflows.
Who Should Care
Summary TLDR
The authors build a templated dataset (11,940 fill-in-the-blank items) to test whether modern LLMs make broad positive/negative associations (not just narrow stereotypes) across four understudied domains: age, beauty, academic institutions, and nationality. They run two directions: SAI (stimulus→attribute) and ASA (attribute→stimulus). Four large models (GPT-4, PaLM-2, Llama-2-13B, Mistral-7B) plus GPT-3.5 show statistically significant correlations between stimulus polarity and generated attribute polarity. Beauty and institution biases are strongest; ageism varies by model; nationality bias is stronger when inferring nationality from traits. Dataset and code are on GitHub for benchmarking.
Problem Statement
LLMs can subtly transfer human social biases into outputs by linking a person’s group (age, looks, school, nationality) to unrelated positive or negative traits. These general polarity associations are understudied and can cause representational harms in real decisions like hiring.
Main Contribution
A semi-automated templating pipeline and a released dataset of 11,940 fill-in-the-blank items testing generalized positive/negative associations across age, beauty, institutions, and nationality.
A two-way task setup: SAI (stimulus→attribute) and ASA (attribute→stimulus) to measure bidirectional associations.
Key Findings
All evaluated models show non-random associations between stimulus polarity and generated attribute polarity.
Beauty-related bias is the strongest and most consistent across models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Kendall's τ (overall) | GPT-4 SAI=0.407, ASA=0.372 | — | — | All bias types combined | Table 1 shows τ and p-values rejecting null in all model/direction settings | Table 1 |
| Beauty bias effect size | GPT-4 SAI τ=0.870 (p≈9.6e-147) | — | — | Beauty domain | Table 2 reports strongest τ values for beauty across models | Table 2 |
What To Try In 7 Days
Run the released dataset against your production LLM to get baseline bias scores.
Add a simple guardrail: block or flag trait inferences that depend on sensitive tokens (age, nationality, institution, appearance).
Compare two model variants (original vs. instruction-tuned or filtered) using SAI/ASA tests to measure change.
Reproducibility
Risks & Boundaries
Limitations
Only four main LLM checkpoints plus GPT-3.5 tested; other models may differ.
English-only templates; cross-lingual behavior is unknown.
When Not To Use
As a parity test for culture-specific biases in non-English settings.
To claim that a given model is safe for all sensitive downstream tasks without additional auditing.
Failure Modes
Models produce non-option or out-of-context completions (numeric selection, free-text completions).
Templates may induce grammatical cues that influence model choice rather than semantics.

