Bias Evaluation Papers — Parsed & Scored for Practitioners

LLMs write biased recommendation letters: women as warm, men as leaders

0.30

0.45

0.40

17

Automatically generated recommendation letters can embed gendered tone and hallucinated details that harm applicants and expose organizations to unfair hiring decisions and reputational or legal risk.

Key finding

Model-generated letters for men score far higher on agency than for women.

Numbers: ChatGPT agency t=10.47, p=1.02e-25 (Table 4).

Two prompt-based tests uncover widespread implicit stereotypes in value-aligned LLMs that pass standard bias benchmarks

0.60

0.65

0.45

14

Even value-aligned, safety-trained LLMs can hold hidden associations that change outcomes in hiring, recommendations, or role assignments; prompt-based behavioral tests let you find risks without model internals.

Key finding

Prompt-based LLM Implicit Bias finds stereotype associations in 19 of 21 tested stereotype types across models.

Numbers: 19/21 stereotype types

A practical review of where LLM bias comes from, how to test it, and common fixes

0.50

0.30

0.60

13

Biased LLM outputs can cause legal risk, reputational harm, and unfair customer outcomes; fixing bias early reduces downstream remediation cost and regulatory exposure.

Key finding

Toxicity can emerge quickly from benign prompts in generative LLMs.

Numbers: toxicity > 0.5 within <100 generations

Open benchmark and a tuned LLM (CALM) show GPT-4-level credit scoring but expose measurable bias

0.60

0.50

0.40

12

LLMs can cut prototype time: GPT-4 often matches expert pipelines on some credit tasks and a tuned open model (CALM) can match closed models, but fairness checks are mandatory before any customer-facing use.

Key finding

GPT-4 can reach near-expert accuracy on some credit tasks.

Numbers: Lending Club Acc 0.762 vs SOTA 0.777; Travel Insurance F1 0.897 vs SOTA 0.912

AnimaLLM: a prototype that scores LLM outputs for truthfulness and how well they consider animals' interests

0.20

0.50

0.20

7

LLMs can embed species and welfare biases that affect products (education, vet advice, policy tools); measuring these biases early helps avoid reputational, legal, or welfare harms.

Key finding

AnimaLLM produced comprehensive score sets for two commercial LLMs.

Numbers: 3,264 S1 and 3,264 S2 scores per model

FairPy: Open toolkit to measure and reduce token-level bias in common language models

0.60

0.40

0.50

6

FairPy makes bias audits repeatable and faster across multiple metrics and models, but mitigation effects are metric-dependent so teams must validate fixes with several tests before deployment.

Key finding

FairPy collects common bias metrics and mitigation methods into one toolkit.

Fast, low-cost debiasing by estimating harmful training samples and 'unlearning' them

0.60

0.70

0.80

6

FMD lets teams reduce model bias quickly and cheaply by changing only a small external counterfactual set and a few classifier parameters, avoiding costly full retraining or large-scale relabeling.

Key finding

On Colored MNIST (bias ratio 0.99) FMD attains nearly the same accuracy as strong baselines while lowering measured counterfactual bias.

Numbers: Acc 80.04% vs 80.41%; Bias 0.2042 vs 0.2302; Time 48s vs 1658s; Samples 5k vs 50k

WinoQueer: a community-sourced benchmark showing many LLMs encode anti-LGBTQ+ bias

0.50

0.70

0.60

5

LLMs used in products can reproduce harmful queer stereotypes; auditing with WinoQueer identifies risks before release and community-derived finetuning reduces those risks.

Key finding

Off-the-shelf LLMs show substantial anti-LGBTQ+ bias.

Numbers: Average WQ bias score = 66.50 (50 is ideal)

LLMs often answer with English-culture content when asked in other languages

0.50

0.60

5

Unlocalized LLM outputs frustrate non-English users, harm trust and product adoption, and can cause reputational or regulatory risk if cultural mismatches appear in customer-facing content.

Key finding

ChatGPT’s concrete outputs are English-dominated for non-English queries

Numbers: In-Culture Score: English 7.3 vs non-English avg 1.4 (ChatGPT, holidays & related objects)

LocalValueBench: a lightweight benchmark to test LLM alignment with Australian values

0.40

0.45

0.30

4

Models deployed in a region must match local legal and cultural expectations; using a local benchmark uncovers misalignment, refusal behaviors, and reviewer subjectivity before real users encounter them.

Key finding

Claude 3 Sonet scored highest on average for Australian value alignment

Numbers: mean=3.725 (scale 1–5)

Alignment reshapes who LLMs serve: widens English dialect gaps, helps some languages, and skews country opinions.

0.60

0.30

3

Alignment choices change who a model helps: biased SFT/PT can reduce utility for non‑US dialects, misrepresent global opinions, and harm product adoption in key markets.

Key finding

Alignment raises English dialect performance unevenly, favoring US English.

Numbers: Dialect disparity grew from ~1% before alignment to up to 17.1% after alignment

LABE: a benchmark, dataset, and rewrite method to find and reduce agency (leader vs helper) bias in LLM outputs

0.55

0.65

0.45

3

LLM-generated bios, reviews, and letters can systematically understate leadership for minority groups; this risks reputational harm, unfair downstream decisions, and regulatory scrutiny. Measuring agency bias and applying targeted rewrites reduces that risk.

Key finding

LLM outputs show larger gender agency bias than comparable human texts.

Numbers: Gender bias avg (ChatGPT) 34.62, human biographies gender diff 10.12

A behavioral-economics framework that measures LLM risk, probability weighting, and loss aversion, and how demographics change those choices

0.60

0.50

0.20

3

If you use LLMs for advice or automation, models differ in how they treat risk and rare events and demographic prompts can shift behavior; test and calibrate model-specific risk parameters before putting them in decision workflows.

Key finding

All three LLMs show average risk-aversion in the context-free setting.

Numbers: σ means: ChatGPT 0.6031, Claude 0.3085, Gemini 0.4959 (Table 5)

Large LLMs show predictable moral shifts under different ethical prompts; fairness, altruism, and virtue prompts hit a practical 'sweet spot

0.35

0.65

0.40

2

LLMs change their moral choices and explanations depending on ethical prompts; pick prompt frames (fairness/altruism/virtue) and add consistency checks before using LLMs in policy, legal, or clinical workflows.

Key finding

Reasoning prompts increase decisiveness but do not ensure human alignment.

Numbers: Reasoning variants raise Yes rates (e.g., +7 pp for Qwen/Gemini) but best public match ~59%

ROBBIE: a multi-dataset, multi-metric bias benchmark plus new adversarial prompts and mitigation tests

0.60

0.30

2

ROBBIE helps teams quantify which user groups a generative model may mistreat and compare mitigations quickly; use it to reduce legal, PR, and product harm before deployment.

Key finding

Self-debiasing substantially reduces toxicity in a smaller base model (GPT2-XL).

Numbers: 46% mean reduction on evaluated prompting datasets

Standard short-form bias tests fail to predict gender–occupation bias in realistic long-form outputs

0.30

0.55

0.20

2

Short-form bias tests can mislead model selection for real products; test models on the actual task and prompts you deploy to avoid unexpected biased outputs.

Key finding

Standard short-form benchmarks poorly predict long-form bias.

Numbers: Mean Spearman correlation = 0.12; range -0.39 to 0.57

Language models show gender bias even on sentences without gendered or stereotyped words

0.60

0.70

0.40

2

Even neutral-sounding text can produce gender-skewed outputs; companies must audit models beyond obvious gender words to avoid biased user-facing content.

Key finding

Models are neutral on only a small fraction of stereotype-free sentences.

Numbers: US (fairness) ranges 9%–41% across tested models and filtered datasets

JBBQ — a Japanese multiple-choice QA benchmark to measure and reduce social bias in Japanese LLMs

0.60

0.30

0.40

2

Japanese LLMs can be more accurate as they scale, but they also amplify harmful stereotypes; use JBBQ to measure bias and apply instruction tuning, bias-aware prompts, or CoT before deploying in user-facing systems.

Key finding

Larger model size raises both accuracy and bias.

Numbers: Acc Avg: 48.6 (13B INST) → 82.7 (SWL3-70B-INST); Diff-bias Amb: +23.1 for SWL3-70B-INST

A decision tree + open-source toolkit that maps your LLM use case and prompt sample to concrete bias and fairness metrics.

0.70

0.50

0.60

2

Fairness risk depends on your real prompts. Running the right metrics on your own prompt sample gives more realistic risk estimates than off-the-shelf benchmarks and helps avoid costly deployment errors or reputational harm.

Key finding

Fairness risk depends far more on prompt population than on model choice.

Numbers: Toxic Fraction varied up to 60× and 129× across prompt sets (GPT-4o: 0.181→0.003; Gemini‑2.5‑Flash‑Lite: 0.645→0.005).

LLMs link age, beauty, school, and nationality to unrelated good/bad traits

0.40

0.50

0.60

2

LLMs systematically link group cues (age, looks, school, nationality) to unrelated good/bad traits; using them without checks can introduce subtle discrimination into hiring, evaluation, or recommendation workflows.

Key finding

All evaluated models show non-random associations between stimulus polarity and generated attribute polarity.

Numbers: Table 1: Kendall's τ SAI/ASA GPT-4 = 0.407 / 0.372 (p≈4.7e-235,1.18e-145)

A framework and automatic metric to detect when LLM 'simulations' turn people into caricatures

0.45

0.65

0.30

2

If you use LLMs to simulate user groups, broad prompts can produce stereotype-like outputs that mislead product decisions; test simulations with this metric and favor specific prompts.

Key finding

Every tested persona was distinguishable from a default persona (individuation > random).

Numbers: mean individuation > 0.5 for every persona (95% CI)

Tune lightweight prompts with counterfactual contrastive loss to reduce gender bias on downstream tasks

0.60

0.50

0.60

2

Co^2PT offers a low-cost way to reduce downstream gender bias: it freezes the main model, tunes small prompts, and avoids costly full-model retraining while lowering fairness gaps on real tasks.

Key finding

On Bias-STS-B, Co^2PT cuts the average absolute similarity-difference (Diff) from PT's 0.321 to 0.058.

Numbers: Diff: PT 0.321 -> Co^2PT 0.058 (Table 2)

Agentic chatbots need an 'interactional' ethics that centres on respect

0.30

0.60

0.35

1

Agentic conversational features can damage user trust, engagement, and wellbeing if systems ignore context and treat people as data points; fixing this protects brand trust and long-term product adoption.

Key finding

Semantic-focused HHH alignment (helpful, honest, harmless) can miss situational disrespect.

LLMs give worse, withholding, and sometimes condescending answers to users with low English, less formal education, or non‑US origin.

0.40

0.30

0.60

1

Personalization or stored user profiles can make models give worse, withholding, or patronizing answers to already vulnerable groups, risking harm, trust loss, and regulatory exposure.

Key finding

Accuracy falls for lower‑education bios across models, with Claude showing large drops.

Numbers: Claude TruthfulQA: control 78.17% → Iran low‑edu 66.22% (−11.95 pts)