Language models show gender bias even on sentences without gendered or stereotyped words

May 1, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper provides wide empirical evaluation across 28 models and multiple datasets, clear metrics, and released code, but findings are limited to English binary pronouns and rely on one generator for dataset creation.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

Catarina G Belém, Preethi Seshadri, Yasaman Razeghi, Sameer Singh

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Even neutral-sounding text can produce gender-skewed outputs; companies must audit models beyond obvious gender words to avoid biased user-facing content.

Who Should Care

Summary TLDR

The paper introduces UnStereoEval (USE), a framework that finds sentence pairs free of strong gender-word associations (using PMI on the PILE pretraining corpus) and tests 28 language models. Using automated generation plus semantic filters, the authors produce diverse stereotype‑free benchmarks and measure fairness with an "Unstereo Score" (US). Across USE and filtered Winobias/Winogender splits, models are neutral only 9%–41% of the time and show systematic male preference in coreference benchmarks. Changing the PMI threshold, model size, or simple deduplication does not reliably fix this bias. Code and datasets are released.

Problem Statement

Prior work links LM gender bias to gendered words in training data. This paper asks: do models still favor one gender when sentences contain no strongly gendered words? The authors build a benchmark and measure whether popular LMs remain neutral on such "stereotype-free" sentence pairs.

Main Contribution

UnStereoEval (USE): a framework to identify and test sentence pairs with minimal gender-word co-occurrence using PMI on PILE.

An automated pipeline using ChatGPT to generate diverse, gender-invariant sentence pairs and semantic filters to remove unnatural or offensive sentences.

Key Findings

Models are neutral on only a small fraction of stereotype-free sentences.

NumbersUS (fairness) ranges 9%–41% across tested models and filtered datasets

Practical UseDo not assume absence of gender words implies unbiased outputs; audit models on stereotype-free sentences too.

Evidence RefAbstract; Table 1; Tables 8-11

Models systematically prefer male completions on filtered Winobias/Winogender.

NumbersPreference disparity >40% (male skew) on WB and WG in many models

Practical UseCoreference and related applications can produce male-skewed outputs even for neutral phrasing; add checks before deployment.

Evidence RefSection 4; Table 3; Tables 15-18

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Unstereo Score (US) fairness9%–41%USE-5/10/20 and filtered Winobias/Winogender (non-stereotypical subsets)Tables 1, 8–11 show US values per model and datasetTables 1,8-11
Preference disparity (male vs female)≥40% male skew on Winobias/Winogender for many modelsFiltered Winobias and Winogender (|MaxPMI| ≤ 0.65)Table 3 and Tables 15-18 report PD values exceeding 40%Table 3; Tables 15-18

What To Try In 7 Days

Run UnStereoEval (or similar PMI-based filter) on your model to measure US and PD on neutral sentences.

Filter your existing test-suite for PMI-based gender co-occurrence to detect hidden biases.

Inspect high PD examples manually and log them to prioritize targeted mitigation (fine-tuning prompts or balanced SFT).

Agent Features

Tool Use
ChatGPT used as a controlled generator

Reproducibility

Risks & Boundaries

Limitations

Focuses on binary English pronouns ('he'/'she'); non-binary and other languages not evaluated.

Non-stereotypical benchmarks were generated using one model (ChatGPT), which may introduce generation artifacts.

When Not To Use

When evaluating bias for non-binary genders or languages other than English.

When you need causal attribution of bias to specific pretraining data sources.

Failure Modes

Generation artifacts: ChatGPT may introduce subtle wording biases that affect results.

PMI blind spots: names or rare tokens may escape the single-pronoun PMI filter and still encode gender.

Core Entities

Models

Pythia (70M–12B)GPT-J-6BOPT (125M–6.7B)Llama-2 (7B/13B/70B)MPT (7B/30B)Mistral-7BMixtral-8x7BOLMo (1B/7B)

Metrics

Unstereo Score (US)Preference Disparity (PD)Fairness gap (∆η)Area under the Fairness Curve (AuFC)

Datasets

USE-5USE-10USE-20Winobias (filtered)Winogender (filtered)PILE (pretraining corpus used for PMI)

Benchmarks

UnStereoEval (USE)Filtered WinobiasFiltered Winogender