Overview
Production Readiness
0.5
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
LLMs used in products can reproduce harmful queer stereotypes; auditing with WinoQueer identifies risks before release and community-derived finetuning reduces those risks.
Summary TLDR
WinoQueer is a 45,540-pair benchmark created from a survey of 295 LGBTQ+ respondents to capture real harms and stereotypes. Evaluating 20 open models (BERT, RoBERTa, ALBERT, BART, GPT-2, OPT, BLOOM) finds an average bias score of 66.50 (50 is unbiased). Fine-tuning on community-written Twitter data cuts bias more (avg -17.98 points) than news (-10.28). The dataset is English-only and has sampling limits; use it to audit models and to test targeted debiasing.
Problem Statement
Current bias benchmarks rarely target anti-LGBTQ+ harms or use input from affected people. That leaves LLMs untested for real-world homophobic and transphobic stereotypes and makes debiasing less effective for specific queer subgroups.
Main Contribution
WinoQueer: a community-sourced paired-sentence benchmark (45,540 pairs) for anti-LGBTQ+ bias
A repeatable community-in-the-loop method: build templates from a survey of harmed people
Baseline audits of 20 off-the-shelf LLMs showing widespread anti-queer bias
Finetuning experiments showing debiasing via community data (Twitter > news)
Key Findings
Off-the-shelf LLMs show substantial anti-LGBTQ+ bias.
Finetuning on community-written Twitter text reduced bias more than news.
Bias severity differs a lot across LGBTQ+ subgroups.
Model architecture matters more than parameter count in this benchmark.
Finetuning can overshoot and flip bias toward non-LGBTQ+ targets.
Results
WQ average bias score (all tested models)
Finetuning effect (mean over 16 models)
Subgroup extremes
Model range (example)
Who Should Care
What To Try In 7 Days
Run WinoQueer on your models to get a baseline bias score
Finetune a small model on community-curated Twitter snippets and re-evaluate
Add subgroup checks (e.g., asexual, nonbinary) to your fairness tests
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey sample is English-speaking, skewed young, and US-heavy; not globally representative
- Templates, names, and pronouns are limited (three pronouns, US-centric names)
- WQ focuses on gender and sexual orientation only; it omits intersectional axes
- Masked vs autoregressive scoring functions differ and may be imperfectly calibrated
- Finetuning experiments are limited to models ≤1.5B params (finetuned) and open-source models
When Not To Use
- As the sole safety check for deployed systems or downstream tasks
- To claim absence of all queer-related stereotypes (low WQ ≠ no harm)
- For non-English models or global populations without further localization
Failure Modes
- Finetuning overshoots and makes models apply stereotypes to non-LGBTQ+ people
- Unbalanced finetuning data yields unequal improvements across subgroups
- Metric differences between model types obscure true changes in behavior
- Sampling noise in Twitter/news corpora introduces irrelevant or spam signals
Core Entities
Models
- BERT
- RoBERTa
- ALBERT
- BART
- GPT2
- OPT
- BLOOM
Metrics
- WQ bias score (0-100; 50 ideal)
- pseudo-log-likelihood (masked models)
- autoregressive token prediction score (autoregressive models)
Datasets
- WinoQueer
- QueerNews
- QueerTwitter
Benchmarks
- WinoQueer

