WinoQueer: a community-sourced benchmark showing many LLMs encode anti-LGBTQ+ bias

June 26, 20237 min

Overview

Decision SnapshotNeeds Validation

WinoQueer is a practical, community-grounded benchmark with clear baselines and debiasing signals. It is ready for auditing and small-scale finetuning, but limited by English-only data, sample bias, and evaluation calibration.

Citations5

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Virginia K. Felkner, Ho-Chun Herbert Chang, Eugene Jang, Jonathan May

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs used in products can reproduce harmful queer stereotypes; auditing with WinoQueer identifies risks before release and community-derived finetuning reduces those risks.

Who Should Care

Summary TLDR

WinoQueer is a 45,540-pair benchmark created from a survey of 295 LGBTQ+ respondents to capture real harms and stereotypes. Evaluating 20 open models (BERT, RoBERTa, ALBERT, BART, GPT-2, OPT, BLOOM) finds an average bias score of 66.50 (50 is unbiased). Fine-tuning on community-written Twitter data cuts bias more (avg -17.98 points) than news (-10.28). The dataset is English-only and has sampling limits; use it to audit models and to test targeted debiasing.

Problem Statement

Current bias benchmarks rarely target anti-LGBTQ+ harms or use input from affected people. That leaves LLMs untested for real-world homophobic and transphobic stereotypes and makes debiasing less effective for specific queer subgroups.

Main Contribution

WinoQueer: a community-sourced paired-sentence benchmark (45,540 pairs) for anti-LGBTQ+ bias

A repeatable community-in-the-loop method: build templates from a survey of harmed people

Key Findings

Off-the-shelf LLMs show substantial anti-LGBTQ+ bias.

NumbersAverage WQ bias score = 66.50 (50 is ideal)

Practical UseRun WinoQueer on your models; a score >>50 flags likely harmful stereotype association in outputs.

Evidence RefTable 5: mean, all models and overall WQ scores

Finetuning on community-written Twitter text reduced bias more than news.

NumbersMean ∆: QueerNews = -10.28 pts, QueerTwitter = -17.98 pts

Practical UseIf you need quick debiasing, fine-tune on community-curated social text; expect larger effect than generic news.

Evidence RefTable 6: finetuning deltas averaged over 16 models

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
WQ average bias score (all tested models)66.5050 (unbiased)16.50 above idealWinoQueer overallTable 5 aggregated meanTable 5
Finetuning effect (mean over 16 models)QueerNews -10.28 pts; QueerTwitter -17.98 ptsWQ baseline per modelQueerTwitter reduces more than News by ~7.7 ptsFinetuned models (16)Table 6 mean deltasTable 6

What To Try In 7 Days

Run WinoQueer on your models to get a baseline bias score

Finetune a small model on community-curated Twitter snippets and re-evaluate

Add subgroup checks (e.g., asexual, nonbinary) to your fairness tests

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Survey sample is English-speaking, skewed young, and US-heavy; not globally representative

Templates, names, and pronouns are limited (three pronouns, US-centric names)

When Not To Use

As the sole safety check for deployed systems or downstream tasks

To claim absence of all queer-related stereotypes (low WQ ≠ no harm)

Failure Modes

Finetuning overshoots and makes models apply stereotypes to non-LGBTQ+ people

Unbalanced finetuning data yields unequal improvements across subgroups

Core Entities

Models

BERTRoBERTaALBERTBARTGPT2OPTBLOOM

Metrics

WQ bias score (0-100; 50 ideal)pseudo-log-likelihood (masked models)autoregressive token prediction score (autoregressive models)

Datasets

WinoQueerQueerNewsQueerTwitter

Benchmarks

WinoQueer