Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.45
Citation Count
14
Why It Matters For Business
Even value-aligned, safety-trained LLMs can hold hidden associations that change outcomes in hiring, recommendations, or role assignments; prompt-based behavioral tests let you find risks without model internals.
Summary TLDR
The authors introduce two psychology-inspired, prompt-based tests for LLMs: LLM Implicit Bias (IAT-style word-association) and LLM Decision Bias (relative decision tasks). Running 33,600+ prompts across 8 value-aligned models, they find pervasive implicit stereotype associations in 19/21 tested stereotype types and show that implicit scores predict subtle discriminatory decisions better than embedding-based measures. Methods are prompt-only and work on API-access models; code and data are on GitHub.
Problem Statement
Current bias benchmarks focus on blatant or explicit bias and often show modern aligned LLMs as unbiased. Yet subtle, automatic associations—implicit biases—can still shape model decisions. We need measurement methods that work with API-only (no-embedding) models and that predict consequential behaviors.
Main Contribution
Two prompt-based measurement tools: LLM Implicit Bias (an IAT-like word-association task) and LLM Decision Bias (relative decision prompts).
Large-scale evaluation (33,600+ prompts) across 8 value-aligned LLMs showing widespread implicit stereotype associations across race, gender, religion, and health.
Empirical comparison showing prompt-based implicit bias correlates with and better predicts downstream decision bias than embedding-based measures.
Robustness checks: multiple prompt templates, synonym variations, automated prompt generation, replication across two evaluation windows.
Key Findings
Prompt-based LLM Implicit Bias finds stereotype associations in 19 of 21 tested stereotype types across models.
LLM Implicit Bias scores are highly statistically different from unbiased baseline.
LLM Decision Bias detects discriminatory choices tied to implicit associations.
Prompt-based implicit bias predicts discriminatory decisions better than embedding bias.
Embedding vs prompt bias correlation: moderate at prompt level, stronger at category level.
Implicit bias tends to increase with model size, but decision bias and rejection rate do not.
Results
LLM Implicit Bias significance
LLM Decision Bias significance
Implicit vs embedding correlation (prompt)
Implicit vs embedding correlation (category)
Implicit bias → decision bias (logistic)
Stereotypes showing bias
Who Should Care
What To Try In 7 Days
Run the provided LLM Implicit Bias prompts on your deployed models to surface hidden associations.
Run the LLM Decision Bias decision suite using tasks matching your product (hiring, recommendations).
Compare prompt-based results to any available embedding-based bias scores and prioritize cases where prompt tests predict bad decisions.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Predictive value of implicit measures is debated; correlation with behavior varies by context.
- LLM Implicit Bias is not an exact analog of human IAT (no reaction-time signal).
- Decision tasks probe a subset of possible real-world harms and are not exhaustive.
- Some models reject prompts (20% rejection rate in tests), which can mask biases.
- Proprietary models limit access to internal embeddings for cross-checks.
When Not To Use
- Do not use as sole proof of legal discrimination or causation.
- Do not interpret scores as model 'intent' or consciousness.
- Do not rely only on these tests for safety certification; combine with domain-specific audits.
Failure Modes
- Prompt phrasing can change measured bias; variation tests reduce but do not eliminate this risk.
- Model refusals or content moderation responses can hide discriminatory tendencies.
- High heterogeneity across prompts and models can produce unstable single-prompt conclusions.
- Embedding and prompt measures can disagree; relying on one may miss signals.
Core Entities
Models
- GPT-4
- GPT-3.5-turbo
- Claude-3-Sonnet
- Claude-3-Opus
- Alpaca-7B
- LLaMA2Chat-7B
- LLaMA2Chat-13B
- LLaMA2Chat-70B
Metrics
- LLM Implicit Bias score (range -1 to 1)
- LLM Decision Bias score (range 0 to 1)
- Embedding bias (WEAT/CEAT)
- Correlation r (prompt/category level)
- Logistic regression coef (predicting decision bias)
Datasets
- IAT study materials (Millisecond test library seed set)
- Automated prompt generations (authors' synthesized prompts)
Benchmarks
- BBQ
- BOLD
- 70 Decisions (prior decision scenarios)

