WinoQueer: a community-sourced benchmark showing many LLMs encode anti-LGBTQ+ bias

June 26, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

5

Authors

Virginia K. Felkner, Ho-Chun Herbert Chang, Eugene Jang, Jonathan May

Links

Abstract / PDF

Why It Matters For Business

LLMs used in products can reproduce harmful queer stereotypes; auditing with WinoQueer identifies risks before release and community-derived finetuning reduces those risks.

Summary TLDR

WinoQueer is a 45,540-pair benchmark created from a survey of 295 LGBTQ+ respondents to capture real harms and stereotypes. Evaluating 20 open models (BERT, RoBERTa, ALBERT, BART, GPT-2, OPT, BLOOM) finds an average bias score of 66.50 (50 is unbiased). Fine-tuning on community-written Twitter data cuts bias more (avg -17.98 points) than news (-10.28). The dataset is English-only and has sampling limits; use it to audit models and to test targeted debiasing.

Problem Statement

Current bias benchmarks rarely target anti-LGBTQ+ harms or use input from affected people. That leaves LLMs untested for real-world homophobic and transphobic stereotypes and makes debiasing less effective for specific queer subgroups.

Main Contribution

WinoQueer: a community-sourced paired-sentence benchmark (45,540 pairs) for anti-LGBTQ+ bias

A repeatable community-in-the-loop method: build templates from a survey of harmed people

Baseline audits of 20 off-the-shelf LLMs showing widespread anti-queer bias

Finetuning experiments showing debiasing via community data (Twitter > news)

Key Findings

Off-the-shelf LLMs show substantial anti-LGBTQ+ bias.

NumbersAverage WQ bias score = 66.50 (50 is ideal)

Finetuning on community-written Twitter text reduced bias more than news.

NumbersMean ∆: QueerNews = -10.28 pts, QueerTwitter = -17.98 pts

Bias severity differs a lot across LGBTQ+ subgroups.

NumbersAsexual avg = 75.85, Queer avg = 60.03

Model architecture matters more than parameter count in this benchmark.

NumbersWeak correlation with size (R^2 = 0.203); masked models trend lower than autoregressive on WQ

Finetuning can overshoot and flip bias toward non-LGBTQ+ targets.

NumbersSome finetuned models scored <50 (e.g., BERT-base-unc WQ-News=45.71, WQ-Twitter=41.05)

Results

WQ average bias score (all tested models)

Value66.50

Baseline50 (unbiased)

Finetuning effect (mean over 16 models)

ValueQueerNews -10.28 pts; QueerTwitter -17.98 pts

BaselineWQ baseline per model

Subgroup extremes

ValueAsexual avg 75.85; Queer avg 60.03

Baseline50 (unbiased)

Model range (example)

ValueLowest 55.93 (ALBERT-xxl-v2) to highest 79.83 (BART-base)

Baseline50

Who Should Care

What To Try In 7 Days

Run WinoQueer on your models to get a baseline bias score

Finetune a small model on community-curated Twitter snippets and re-evaluate

Add subgroup checks (e.g., asexual, nonbinary) to your fairness tests

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey sample is English-speaking, skewed young, and US-heavy; not globally representative
  • Templates, names, and pronouns are limited (three pronouns, US-centric names)
  • WQ focuses on gender and sexual orientation only; it omits intersectional axes
  • Masked vs autoregressive scoring functions differ and may be imperfectly calibrated
  • Finetuning experiments are limited to models ≤1.5B params (finetuned) and open-source models

When Not To Use

  • As the sole safety check for deployed systems or downstream tasks
  • To claim absence of all queer-related stereotypes (low WQ ≠ no harm)
  • For non-English models or global populations without further localization

Failure Modes

  • Finetuning overshoots and makes models apply stereotypes to non-LGBTQ+ people
  • Unbalanced finetuning data yields unequal improvements across subgroups
  • Metric differences between model types obscure true changes in behavior
  • Sampling noise in Twitter/news corpora introduces irrelevant or spam signals

Core Entities

Models

  • BERT
  • RoBERTa
  • ALBERT
  • BART
  • GPT2
  • OPT
  • BLOOM

Metrics

  • WQ bias score (0-100; 50 ideal)
  • pseudo-log-likelihood (masked models)
  • autoregressive token prediction score (autoregressive models)

Datasets

  • WinoQueer
  • QueerNews
  • QueerTwitter

Benchmarks

  • WinoQueer