Language models show gender bias even on sentences without gendered or stereotyped words

May 1, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

2

Authors

Catarina G Belém, Preethi Seshadri, Yasaman Razeghi, Sameer Singh

Links

Abstract / PDF

Why It Matters For Business

Even neutral-sounding text can produce gender-skewed outputs; companies must audit models beyond obvious gender words to avoid biased user-facing content.

Summary TLDR

The paper introduces UnStereoEval (USE), a framework that finds sentence pairs free of strong gender-word associations (using PMI on the PILE pretraining corpus) and tests 28 language models. Using automated generation plus semantic filters, the authors produce diverse stereotype‑free benchmarks and measure fairness with an "Unstereo Score" (US). Across USE and filtered Winobias/Winogender splits, models are neutral only 9%–41% of the time and show systematic male preference in coreference benchmarks. Changing the PMI threshold, model size, or simple deduplication does not reliably fix this bias. Code and datasets are released.

Problem Statement

Prior work links LM gender bias to gendered words in training data. This paper asks: do models still favor one gender when sentences contain no strongly gendered words? The authors build a benchmark and measure whether popular LMs remain neutral on such "stereotype-free" sentence pairs.

Main Contribution

UnStereoEval (USE): a framework to identify and test sentence pairs with minimal gender-word co-occurrence using PMI on PILE.

An automated pipeline using ChatGPT to generate diverse, gender-invariant sentence pairs and semantic filters to remove unnatural or offensive sentences.

Repurposing Winobias and Winogender to their non-stereotypical subsets and creating three new USE datasets (USE-5/10/20).

Large-scale evaluation of 28 LMs (Pythia, GPT-J, OPT, Llama-2, Mistral, OLMo, MPT, etc.) showing low fairness and consistent male skew in some benchmarks.

Public release of code and datasets to reproduce and extend the evaluation.

Key Findings

Models are neutral on only a small fraction of stereotype-free sentences.

NumbersUS (fairness) ranges 9%–41% across tested models and filtered datasets

Models systematically prefer male completions on filtered Winobias/Winogender.

NumbersPreference disparity >40% (male skew) on WB and WG in many models

Filtering sentences by PMI (removing gender‑cooccurring words) barely changes fairness scores.

NumbersFairness deltas |∆0.65| ≤ 1.33% (USE-5) and up to 11.84% (WG) in reported results

Model size and basic pretraining deduplication do not consistently fix bias.

NumbersNo consistent fairness trend across sizes; deduplication effects vary (e.g., +6.73 for Pythia 70M, −17.14 for Pythia 410

Results

Unstereo Score (US) fairness

Value9%–41%

Preference disparity (male vs female)

Value≥40% male skew on Winobias/Winogender for many models

Effect of enforcing MaxPMI filter (fairness gap ∆0.65)

Value|∆0.65| ≤ 1.33% (USE-5) up to 11.84% (WG) depending on dataset

BaselineOrig. (η = ∞)

Impact of deduplication

ValueMixed: improvements for some Pythia sizes, worse for others (e.g., +6.73 for 70M, −17.14 for 410M on USE-5)

BaselineOriginal Pythia models

Who Should Care

What To Try In 7 Days

Run UnStereoEval (or similar PMI-based filter) on your model to measure US and PD on neutral sentences.

Filter your existing test-suite for PMI-based gender co-occurrence to detect hidden biases.

Inspect high PD examples manually and log them to prioritize targeted mitigation (fine-tuning prompts or balanced SFT).

Agent Features

Tool Use

  • ChatGPT used as a controlled generator

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Focuses on binary English pronouns ('he'/'she'); non-binary and other languages not evaluated.
  • Non-stereotypical benchmarks were generated using one model (ChatGPT), which may introduce generation artifacts.
  • PMI is computed on PILE and assumes it approximates models' pretraining distributions; model-specific pretraining differences exist.

When Not To Use

  • When evaluating bias for non-binary genders or languages other than English.
  • When you need causal attribution of bias to specific pretraining data sources.
  • To certify absence of bias in downstream task-specific settings without additional testing.

Failure Modes

  • Generation artifacts: ChatGPT may introduce subtle wording biases that affect results.
  • PMI blind spots: names or rare tokens may escape the single-pronoun PMI filter and still encode gender.
  • Small filtered datasets: stricter MaxPMI thresholds drastically shrink some benchmarks, reducing statistical power.

Core Entities

Models

  • Pythia (70M–12B)
  • GPT-J-6B
  • OPT (125M–6.7B)
  • Llama-2 (7B/13B/70B)
  • MPT (7B/30B)
  • Mistral-7B
  • Mixtral-8x7B
  • OLMo (1B/7B)

Metrics

  • Unstereo Score (US)
  • Preference Disparity (PD)
  • Fairness gap (∆η)
  • Area under the Fairness Curve (AuFC)

Datasets

  • USE-5
  • USE-10
  • USE-20
  • Winobias (filtered)
  • Winogender (filtered)
  • PILE (pretraining corpus used for PMI)

Benchmarks

  • UnStereoEval (USE)
  • Filtered Winobias
  • Filtered Winogender