WinoQueer: a community-sourced benchmark showing many LLMs encode anti-LGBTQ+ bias

Overview

Decision SnapshotNeeds Validation

WinoQueer is a practical, community-grounded benchmark with clear baselines and debiasing signals. It is ready for auditing and small-scale finetuning, but limited by English-only data, sample bias, and evaluation calibration.

Citations5

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Virginia K. Felkner, Ho-Chun Herbert Chang, Eugene Jang, Jonathan May

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs used in products can reproduce harmful queer stereotypes; auditing with WinoQueer identifies risks before release and community-derived finetuning reduces those risks.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Data Scientist Founder

Summary TLDR

WinoQueer is a 45,540-pair benchmark created from a survey of 295 LGBTQ+ respondents to capture real harms and stereotypes. Evaluating 20 open models (BERT, RoBERTa, ALBERT, BART, GPT-2, OPT, BLOOM) finds an average bias score of 66.50 (50 is unbiased). Fine-tuning on community-written Twitter data cuts bias more (avg -17.98 points) than news (-10.28). The dataset is English-only and has sampling limits; use it to audit models and to test targeted debiasing.

Problem Statement

Current bias benchmarks rarely target anti-LGBTQ+ harms or use input from affected people. That leaves LLMs untested for real-world homophobic and transphobic stereotypes and makes debiasing less effective for specific queer subgroups.

Main Contribution

WinoQueer: a community-sourced paired-sentence benchmark (45,540 pairs) for anti-LGBTQ+ bias

A repeatable community-in-the-loop method: build templates from a survey of harmed people

Key Findings

Off-the-shelf LLMs show substantial anti-LGBTQ+ bias.

NumbersAverage WQ bias score = 66.50 (50 is ideal)

Practical UseRun WinoQueer on your models; a score >>50 flags likely harmful stereotype association in outputs.

Evidence RefTable 5: mean, all models and overall WQ scores

Finetuning on community-written Twitter text reduced bias more than news.

NumbersMean ∆: QueerNews = -10.28 pts, QueerTwitter = -17.98 pts

Practical UseIf you need quick debiasing, fine-tune on community-curated social text; expect larger effect than generic news.

Evidence RefTable 6: finetuning deltas averaged over 16 models

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
WQ average bias score (all tested models)	66.50	50 (unbiased)	16.50 above ideal	WinoQueer overall	Table 5 aggregated mean	Table 5
Finetuning effect (mean over 16 models)	QueerNews -10.28 pts; QueerTwitter -17.98 pts	WQ baseline per model	QueerTwitter reduces more than News by ~7.7 pts	Finetuned models (16)	Table 6 mean deltas	Table 6

What To Try In 7 Days

Run WinoQueer on your models to get a baseline bias score

Finetune a small model on community-curated Twitter snippets and re-evaluate

Add subgroup checks (e.g., asexual, nonbinary) to your fairness tests

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/katyfelkner/winoqueer

Data URLs

https://github.com/katyfelkner/winoqueer

Risks & Boundaries

Limitations

Survey sample is English-speaking, skewed young, and US-heavy; not globally representative

Templates, names, and pronouns are limited (three pronouns, US-centric names)

When Not To Use

As the sole safety check for deployed systems or downstream tasks

To claim absence of all queer-related stereotypes (low WQ ≠ no harm)

Failure Modes

Finetuning overshoots and makes models apply stereotypes to non-LGBTQ+ people

Unbalanced finetuning data yields unequal improvements across subgroups

Core Entities

Models

BERTRoBERTaALBERTBARTGPT2OPTBLOOM

Metrics

WQ bias score (0-100; 50 ideal)pseudo-log-likelihood (masked models)autoregressive token prediction score (autoregressive models)

Datasets

WinoQueerQueerNewsQueerTwitter

Benchmarks

WinoQueer

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Off-the-shelf LLMs show substantial anti-LGBTQ+ bias.

Finetuning on community-written Twitter text reduced bias more than news.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding