Two prompt-based tests uncover widespread implicit stereotypes in value-aligned LLMs that pass standard bias benchmarks

February 6, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.45

Citation Count

14

Authors

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, Thomas L. Griffiths

Links

Abstract / PDF

Why It Matters For Business

Even value-aligned, safety-trained LLMs can hold hidden associations that change outcomes in hiring, recommendations, or role assignments; prompt-based behavioral tests let you find risks without model internals.

Summary TLDR

The authors introduce two psychology-inspired, prompt-based tests for LLMs: LLM Implicit Bias (IAT-style word-association) and LLM Decision Bias (relative decision tasks). Running 33,600+ prompts across 8 value-aligned models, they find pervasive implicit stereotype associations in 19/21 tested stereotype types and show that implicit scores predict subtle discriminatory decisions better than embedding-based measures. Methods are prompt-only and work on API-access models; code and data are on GitHub.

Problem Statement

Current bias benchmarks focus on blatant or explicit bias and often show modern aligned LLMs as unbiased. Yet subtle, automatic associations—implicit biases—can still shape model decisions. We need measurement methods that work with API-only (no-embedding) models and that predict consequential behaviors.

Main Contribution

Two prompt-based measurement tools: LLM Implicit Bias (an IAT-like word-association task) and LLM Decision Bias (relative decision prompts).

Large-scale evaluation (33,600+ prompts) across 8 value-aligned LLMs showing widespread implicit stereotype associations across race, gender, religion, and health.

Empirical comparison showing prompt-based implicit bias correlates with and better predicts downstream decision bias than embedding-based measures.

Robustness checks: multiple prompt templates, synonym variations, automated prompt generation, replication across two evaluation windows.

Key Findings

Prompt-based LLM Implicit Bias finds stereotype associations in 19 of 21 tested stereotype types across models.

Numbers19/21 stereotype types

LLM Implicit Bias scores are highly statistically different from unbiased baseline.

Numbersone-sample t(33,599)=76.39, p<.001

LLM Decision Bias detects discriminatory choices tied to implicit associations.

Numbersone-sample t(26,528)=36.25, p<.001

Prompt-based implicit bias predicts discriminatory decisions better than embedding bias.

Numberslogit coef b≈0.986 (95% CI [0.753,1.219]); odds ≈2.68 per unit

Embedding vs prompt bias correlation: moderate at prompt level, stronger at category level.

Numbersr=0.36 prompt-level; r=0.72 category-level

Implicit bias tends to increase with model size, but decision bias and rejection rate do not.

Numbersscaling analysis: implicit ↑ with size; decision bias ↛ size

Results

LLM Implicit Bias significance

Valuet(33599)=76.39, p<.001

Baseline0 (unbiased)

LLM Decision Bias significance

Valuet(26528)=36.25, p<.001

Baseline0.5 (unbiased)

Implicit vs embedding correlation (prompt)

Valuer=0.36

Implicit vs embedding correlation (category)

Valuer=0.72

Implicit bias → decision bias (logistic)

Valuecoef b≈0.986 (95% CI [0.753,1.219]); odds≈2.68

Stereotypes showing bias

Value19/21 stereotype types

Baseline21 types tested

Who Should Care

What To Try In 7 Days

Run the provided LLM Implicit Bias prompts on your deployed models to surface hidden associations.

Run the LLM Decision Bias decision suite using tasks matching your product (hiring, recommendations).

Compare prompt-based results to any available embedding-based bias scores and prioritize cases where prompt tests predict bad decisions.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Predictive value of implicit measures is debated; correlation with behavior varies by context.
  • LLM Implicit Bias is not an exact analog of human IAT (no reaction-time signal).
  • Decision tasks probe a subset of possible real-world harms and are not exhaustive.
  • Some models reject prompts (20% rejection rate in tests), which can mask biases.
  • Proprietary models limit access to internal embeddings for cross-checks.

When Not To Use

  • Do not use as sole proof of legal discrimination or causation.
  • Do not interpret scores as model 'intent' or consciousness.
  • Do not rely only on these tests for safety certification; combine with domain-specific audits.

Failure Modes

  • Prompt phrasing can change measured bias; variation tests reduce but do not eliminate this risk.
  • Model refusals or content moderation responses can hide discriminatory tendencies.
  • High heterogeneity across prompts and models can produce unstable single-prompt conclusions.
  • Embedding and prompt measures can disagree; relying on one may miss signals.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • Claude-3-Sonnet
  • Claude-3-Opus
  • Alpaca-7B
  • LLaMA2Chat-7B
  • LLaMA2Chat-13B
  • LLaMA2Chat-70B

Metrics

  • LLM Implicit Bias score (range -1 to 1)
  • LLM Decision Bias score (range 0 to 1)
  • Embedding bias (WEAT/CEAT)
  • Correlation r (prompt/category level)
  • Logistic regression coef (predicting decision bias)

Datasets

  • IAT study materials (Millisecond test library seed set)
  • Automated prompt generations (authors' synthesized prompts)

Benchmarks

  • BBQ
  • BOLD
  • 70 Decisions (prior decision scenarios)