LLMs write biased recommendation letters: women as warm, men as leaders

October 13, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper provides multiple quantitative metrics and clear significance on two popular LLMs, but evaluation is limited to binary gender, a few models, and synthetic agentic labels.

Citations17

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 45%

Authors

Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, Nanyun Peng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automatically generated recommendation letters can embed gendered tone and hallucinated details that harm applicants and expose organizations to unfair hiring decisions and reputational or legal risk.

Who Should Care

Summary TLDR

This paper measures gender bias in recommendation letters produced by large language models. The authors build a testbed of lexical and style metrics (formality, positivity, agency), a hallucination-detection pipeline (Context-Sentence NLI), and a balanced biography dataset (WikiBias-Aug). Evaluating ChatGPT and Alpaca, they find male-targeted letters are consistently more agentic, more positive, and often more formal; models also hallucinate biased content that amplifies those gaps. Results are statistically strong for agency and positivity (e.g., ChatGPT agency t=10.47, p≈1e-25). Use LLM outputs for real letters only after careful audit and human edit.

Problem Statement

People use LLMs to draft recommendation letters, but model outputs may embed gendered language and hallucinated facts that disadvantage applicants. The paper asks: do LLMs produce different word choices and styles for male vs female candidates, and do hallucinations amplify those biases?

Main Contribution

Define and measure two bias types for reference letters: lexical content (nouns/adjectives) and language style (formality, positivity, agency).

Introduce a hallucination-bias pipeline (Context-Sentence NLI) to detect biased hallucinated sentences and measure bias propagation/amplification.

Key Findings

Model-generated letters for men score far higher on agency than for women.

NumbersChatGPT agency t=10.47, p=1.02e-25 (Table 4).

Practical UseIf you auto-generate letters, male candidates will often receive more 'leader-like' wording; human review is required to avoid unfair impression differences.

Evidence RefTable 4

Generated letters for men are more positive and (often) more formal than those for women.

NumbersChatGPT positivity t=5.93, p=1.58e-09; ChatGPT formality t=1.48, p=0.07 (Table 4).

Practical UsePositivity and formality gaps can influence selection outcomes; require post-editing or controlled prompts to equalize tone.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ChatGPT agency (language style)t=10.47, p=1.02e-25 (male > female)CBG generated lettersTable 4 shows large and highly significant agency gap favoring male-targeted letters.Table 4
ChatGPT positivity (language style)t=5.93, p=1.58e-09 (male > female)CBG generated lettersTable 4 reports a significant positivity gap.Table 4

What To Try In 7 Days

Audit a sample of model-generated letters for gendered words (agentic vs communal) and tone differences.

Add simple prompt constraints asking for neutral wording and explicit leadership-equivalent adjectives for all genders.

Run an NLI-based check to flag and remove hallucinated sentences before use in real letters.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Analysis is limited to binary gender due to dataset labels; other gender identities are not covered.

Only a few LLMs were tested (ChatGPT and Alpaca mainly); results may differ on other models.

When Not To Use

When you expect to use a generated letter without human review.

When precise factual accuracy of details is required (models hallucinate).

Failure Modes

Model hallucinations that inject biased or false facts into letters.

Gendered wording that shifts perception (communal vs agentic) even when context is identical.

Core Entities

Models

ChatGPTAlpacaVicunaStableLMRoBERTa-Large (NLI)BERT (agentic classifier)

Metrics

Odds Ratio (lexical saliency)t-tests on formality/positivity/agencyWEAT scoresGeneration success rate

Datasets

WikiBias (original)WikiBias-Aug (this paper)Bias in Bios (used to synthesize data)GYAFC (formality classifier training corpus)SST-2 (sentiment model training corpus)

Benchmarks

WEATContext-Sentence NLI (proposed)