LLMs write biased recommendation letters: women as warm, men as leaders

Overview

Decision SnapshotNeeds Validation

The paper provides multiple quantitative metrics and clear significance on two popular LLMs, but evaluation is limited to binary gender, a few models, and synthetic agentic labels.

Citations17

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 45%

Authors

Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, Nanyun Peng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automatically generated recommendation letters can embed gendered tone and hallucinated details that harm applicants and expose organizations to unfair hiring decisions and reputational or legal risk.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Founder

Summary TLDR

This paper measures gender bias in recommendation letters produced by large language models. The authors build a testbed of lexical and style metrics (formality, positivity, agency), a hallucination-detection pipeline (Context-Sentence NLI), and a balanced biography dataset (WikiBias-Aug). Evaluating ChatGPT and Alpaca, they find male-targeted letters are consistently more agentic, more positive, and often more formal; models also hallucinate biased content that amplifies those gaps. Results are statistically strong for agency and positivity (e.g., ChatGPT agency t=10.47, p≈1e-25). Use LLM outputs for real letters only after careful audit and human edit.

Problem Statement

People use LLMs to draft recommendation letters, but model outputs may embed gendered language and hallucinated facts that disadvantage applicants. The paper asks: do LLMs produce different word choices and styles for male vs female candidates, and do hallucinations amplify those biases?

Main Contribution

Define and measure two bias types for reference letters: lexical content (nouns/adjectives) and language style (formality, positivity, agency).

Introduce a hallucination-bias pipeline (Context-Sentence NLI) to detect biased hallucinated sentences and measure bias propagation/amplification.

Key Findings

Model-generated letters for men score far higher on agency than for women.

NumbersChatGPT agency t=10.47, p=1.02e-25 (Table 4).

Practical UseIf you auto-generate letters, male candidates will often receive more 'leader-like' wording; human review is required to avoid unfair impression differences.

Evidence RefTable 4

Generated letters for men are more positive and (often) more formal than those for women.

NumbersChatGPT positivity t=5.93, p=1.58e-09; ChatGPT formality t=1.48, p=0.07 (Table 4).

Practical UsePositivity and formality gaps can influence selection outcomes; require post-editing or controlled prompts to equalize tone.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ChatGPT agency (language style)	t=10.47, p=1.02e-25 (male > female)	—	—	CBG generated letters	Table 4 shows large and highly significant agency gap favoring male-targeted letters.	Table 4
ChatGPT positivity (language style)	t=5.93, p=1.58e-09 (male > female)	—	—	CBG generated letters	Table 4 reports a significant positivity gap.	Table 4

What To Try In 7 Days

Audit a sample of model-generated letters for gendered words (agentic vs communal) and tone differences.

Add simple prompt constraints asking for neutral wording and explicit leadership-equivalent adjectives for all genders.

Run an NLI-based check to flag and remove hallucinated sentences before use in real letters.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/uclanlp/biases-llm-reference-letters

Data URLs

https://github.com/uclanlp/biases-llm-reference-letters

Risks & Boundaries

Limitations

Analysis is limited to binary gender due to dataset labels; other gender identities are not covered.

Only a few LLMs were tested (ChatGPT and Alpaca mainly); results may differ on other models.

When Not To Use

When you expect to use a generated letter without human review.

When precise factual accuracy of details is required (models hallucinate).

Failure Modes

Model hallucinations that inject biased or false facts into letters.

Gendered wording that shifts perception (communal vs agentic) even when context is identical.

Core Entities

Models

ChatGPTAlpacaVicunaStableLMRoBERTa-Large (NLI)BERT (agentic classifier)

Metrics

Odds Ratio (lexical saliency)t-tests on formality/positivity/agencyWEAT scoresGeneration success rate

Datasets

WikiBias (original)WikiBias-Aug (this paper)Bias in Bios (used to synthesize data)GYAFC (formality classifier training corpus)SST-2 (sentiment model training corpus)

Benchmarks

WEATContext-Sentence NLI (proposed)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Model-generated letters for men score far higher on agency than for women.

Generated letters for men are more positive and (often) more formal than those for women.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

BIASSCOPE: an automated LLM-driven pipeline that finds evaluation biases and builds a tougher JudgeBench‑Pro

Key finding

Pairwise comparisons amplify stylistic distractions; absolute scoring is more robust

Key finding

Psychometric audit finds durable provider-level biases in LLMs that can compound across multi-model systems

Key finding

JudgeBiasBench: a 12-type benchmark and bias-aware training to reduce LLM-judge bias

Key finding

Alignment makes LLM evaluators overuse certain scores; prompt score range is a cheap fix

Key finding