Language models show gender bias even on sentences without gendered or stereotyped words

Overview

Decision SnapshotReady For Pilot

The paper provides wide empirical evaluation across 28 models and multiple datasets, clear metrics, and released code, but findings are limited to English binary pronouns and rely on one generator for dataset creation.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

Catarina G Belém, Preethi Seshadri, Yasaman Razeghi, Sameer Singh

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Even neutral-sounding text can produce gender-skewed outputs; companies must audit models beyond obvious gender words to avoid biased user-facing content.

Who Should Care

CTO Product Manager ML Engineer Data Scientist CEO

Summary TLDR

The paper introduces UnStereoEval (USE), a framework that finds sentence pairs free of strong gender-word associations (using PMI on the PILE pretraining corpus) and tests 28 language models. Using automated generation plus semantic filters, the authors produce diverse stereotype‑free benchmarks and measure fairness with an "Unstereo Score" (US). Across USE and filtered Winobias/Winogender splits, models are neutral only 9%–41% of the time and show systematic male preference in coreference benchmarks. Changing the PMI threshold, model size, or simple deduplication does not reliably fix this bias. Code and datasets are released.

Problem Statement

Prior work links LM gender bias to gendered words in training data. This paper asks: do models still favor one gender when sentences contain no strongly gendered words? The authors build a benchmark and measure whether popular LMs remain neutral on such "stereotype-free" sentence pairs.

Main Contribution

UnStereoEval (USE): a framework to identify and test sentence pairs with minimal gender-word co-occurrence using PMI on PILE.

An automated pipeline using ChatGPT to generate diverse, gender-invariant sentence pairs and semantic filters to remove unnatural or offensive sentences.

Key Findings

Models are neutral on only a small fraction of stereotype-free sentences.

NumbersUS (fairness) ranges 9%–41% across tested models and filtered datasets

Practical UseDo not assume absence of gender words implies unbiased outputs; audit models on stereotype-free sentences too.

Evidence RefAbstract; Table 1; Tables 8-11

Models systematically prefer male completions on filtered Winobias/Winogender.

NumbersPreference disparity >40% (male skew) on WB and WG in many models

Practical UseCoreference and related applications can produce male-skewed outputs even for neutral phrasing; add checks before deployment.

Evidence RefSection 4; Table 3; Tables 15-18

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Unstereo Score (US) fairness	9%–41%	—	—	USE-5/10/20 and filtered Winobias/Winogender (non-stereotypical subsets)	Tables 1, 8–11 show US values per model and dataset	Tables 1,8-11
Preference disparity (male vs female)	≥40% male skew on Winobias/Winogender for many models	—	—	Filtered Winobias and Winogender (\|MaxPMI\| ≤ 0.65)	Table 3 and Tables 15-18 report PD values exceeding 40%	Table 3; Tables 15-18

What To Try In 7 Days

Run UnStereoEval (or similar PMI-based filter) on your model to measure US and PD on neutral sentences.

Filter your existing test-suite for PMI-based gender co-occurrence to detect hidden biases.

Inspect high PD examples manually and log them to prioritize targeted mitigation (fine-tuning prompts or balanced SFT).

Agent Features

Tool Use

ChatGPT used as a controlled generator

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://ucinlp.github.io/unstereo-eval https://github.com/ucinlp/unstereo-eval

Data URLs

https://ucinlp.github.io/unstereo-eval https://github.com/ucinlp/unstereo-eval

Risks & Boundaries

Limitations

Focuses on binary English pronouns ('he'/'she'); non-binary and other languages not evaluated.

Non-stereotypical benchmarks were generated using one model (ChatGPT), which may introduce generation artifacts.

When Not To Use

When evaluating bias for non-binary genders or languages other than English.

When you need causal attribution of bias to specific pretraining data sources.

Failure Modes

Generation artifacts: ChatGPT may introduce subtle wording biases that affect results.

PMI blind spots: names or rare tokens may escape the single-pronoun PMI filter and still encode gender.

Core Entities

Models

Pythia (70M–12B)GPT-J-6BOPT (125M–6.7B)Llama-2 (7B/13B/70B)MPT (7B/30B)Mistral-7BMixtral-8x7BOLMo (1B/7B)

Metrics

Unstereo Score (US)Preference Disparity (PD)Fairness gap (∆η)Area under the Fairness Curve (AuFC)

Datasets

USE-5USE-10USE-20Winobias (filtered)Winogender (filtered)PILE (pretraining corpus used for PMI)

Benchmarks

UnStereoEval (USE)Filtered WinobiasFiltered Winogender

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Models are neutral on only a small fraction of stereotype-free sentences.

Models systematically prefer male completions on filtered Winobias/Winogender.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

BIASSCOPE: an automated LLM-driven pipeline that finds evaluation biases and builds a tougher JudgeBench‑Pro

Key finding

Pairwise comparisons amplify stylistic distractions; absolute scoring is more robust

Key finding

Psychometric audit finds durable provider-level biases in LLMs that can compound across multi-model systems

Key finding

JudgeBiasBench: a 12-type benchmark and bias-aware training to reduce LLM-judge bias

Key finding

Alignment makes LLM evaluators overuse certain scores; prompt score range is a cheap fix

Key finding