Overview
WinoQueer is a practical, community-grounded benchmark with clear baselines and debiasing signals. It is ready for auditing and small-scale finetuning, but limited by English-only data, sample bias, and evaluation calibration.
Citations5
Evidence Strength0.80
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 70%
Why It Matters For Business
LLMs used in products can reproduce harmful queer stereotypes; auditing with WinoQueer identifies risks before release and community-derived finetuning reduces those risks.
Who Should Care
Summary TLDR
WinoQueer is a 45,540-pair benchmark created from a survey of 295 LGBTQ+ respondents to capture real harms and stereotypes. Evaluating 20 open models (BERT, RoBERTa, ALBERT, BART, GPT-2, OPT, BLOOM) finds an average bias score of 66.50 (50 is unbiased). Fine-tuning on community-written Twitter data cuts bias more (avg -17.98 points) than news (-10.28). The dataset is English-only and has sampling limits; use it to audit models and to test targeted debiasing.
Problem Statement
Current bias benchmarks rarely target anti-LGBTQ+ harms or use input from affected people. That leaves LLMs untested for real-world homophobic and transphobic stereotypes and makes debiasing less effective for specific queer subgroups.
Main Contribution
WinoQueer: a community-sourced paired-sentence benchmark (45,540 pairs) for anti-LGBTQ+ bias
A repeatable community-in-the-loop method: build templates from a survey of harmed people
Key Findings
Off-the-shelf LLMs show substantial anti-LGBTQ+ bias.
Finetuning on community-written Twitter text reduced bias more than news.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| WQ average bias score (all tested models) | 66.50 | 50 (unbiased) | 16.50 above ideal | WinoQueer overall | Table 5 aggregated mean | Table 5 |
| Finetuning effect (mean over 16 models) | QueerNews -10.28 pts; QueerTwitter -17.98 pts | WQ baseline per model | QueerTwitter reduces more than News by ~7.7 pts | Finetuned models (16) | Table 6 mean deltas | Table 6 |
What To Try In 7 Days
Run WinoQueer on your models to get a baseline bias score
Finetune a small model on community-curated Twitter snippets and re-evaluate
Add subgroup checks (e.g., asexual, nonbinary) to your fairness tests
Reproducibility
Risks & Boundaries
Limitations
Survey sample is English-speaking, skewed young, and US-heavy; not globally representative
Templates, names, and pronouns are limited (three pronouns, US-centric names)
When Not To Use
As the sole safety check for deployed systems or downstream tasks
To claim absence of all queer-related stereotypes (low WQ ≠ no harm)
Failure Modes
Finetuning overshoots and makes models apply stereotypes to non-LGBTQ+ people
Unbalanced finetuning data yields unequal improvements across subgroups

