Overview
Production Readiness
0.55
Novelty Score
0.65
Cost Impact Score
0.45
Citation Count
3
Why It Matters For Business
LLM-generated bios, reviews, and letters can systematically understate leadership for minority groups; this risks reputational harm, unfair downstream decisions, and regulatory scrutiny. Measuring agency bias and applying targeted rewrites reduces that risk.
Summary TLDR
This paper introduces LABE, a benchmark to measure 'language agency' bias — whether text frames people as agentic (leader/initiative) or communal (helper/supportive). The authors build a labeled dataset (LAC, 3,724 sentences), a BERT agency classifier (91.7% test accuracy), and evaluate three LLMs (ChatGPT gpt-3.5-turbo-1106, Llama3-8B, Mistral-7B) on three tasks (biography, professor review, reference letter). Key findings: LLM outputs are more gender-biased than human text, intersectional groups (e.g., Black women) suffer the most, simple fairness prompts are unstable and can worsen bias, and a targeted Mitigation via Selective Rewrite (MSR) using the classifier is more reliable. Code andデ
Problem Statement
Automatic text generators can encode social bias not only in words but in style: who is described as a leader (agentic) vs a helper (communal). Existing measures are brittle (string matching, sentiment) and there is no comprehensive benchmark that (1) tests gender, race, and intersectional agency bias across common generation tasks and (2) provides reliable automated scoring and mitigation.
Main Contribution
LABE: a template-based benchmark to measure gender, racial, and intersectional agency bias in LLM-generated biographies, professor reviews, and reference letters (5,400 prompts).
LAC: a 3,724-sentence labeled dataset for agentic vs communal sentences; used to train a high-accuracy BERT agency classifier (91.69% test acc).
Empirical evaluation of ChatGPT (gpt-3.5), Llama3-8B, and Mistral-7B showing amplified gender and intersectional agency biases compared to human texts.
MSR (Mitigation via Selective Rewrite): a classifier-guided rewrite pipeline that identifies communal sentences and asks the model to rephrase them to be more agentic; more effective and stable than adding a fairness prompt.
Key Findings
LLM outputs show larger gender agency bias than comparable human texts.
Intersectional groups (race+gender) suffer the largest agency gaps; Black women often receive the lowest agentic language.
Simple prompt-based fairness instructions are unstable and can worsen bias.
Selective rewrite (MSR) guided by the agency classifier reduces bias more reliably than prompt-only fixes.
Results
Average Gender Bias (Inter-group variance) — ChatGPT
Average Gender Bias — Mistral
Average Intersectional Bias — Llama3
Accuracy
Mitigation wins (MSR vs prompt)
Who Should Care
What To Try In 7 Days
Run LABE-style templated prompts on your generation pipeline to spot agentic vs communal gaps.
Train or reuse a small BERT agency classifier (LAC) to label outputs quickly.
Apply a selective-rewrite step (MSR) to low-agency sentences rather than only adding fairness instructions.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Binary gender and only four racial groups were analyzed; other identities are not covered.
- Human-written racial and intersectional datasets were scarce; reference-letter human data is a proxy.
- Mitigation experiments are small-scale (96 samples per task) and may not generalize.
- The LAC dataset was partially synthesized by an LLM; this can risk propagating model biases despite human verification.
When Not To Use
- Do not rely on LABE as a final fairness check for non-binary genders or other racial groups not included here.
- Avoid using MSR alone for fully automated high-stakes decisions without human review.
- Do not assume prompt-based fairness instructions will suffice; they can worsen bias.
Failure Modes
- Agency classifier mistakes (false agentic/communal labels) can trigger inappropriate rewrites.
- MSR can unevenly boost majority groups more than minority groups, increasing variance across slices.
- Prompt-based mitigation can amplify bias or produce unpredictable changes depending on model.
Core Entities
Models
- ChatGPT (gpt-3.5-turbo-1106)
- Llama3-8B-Instruct
- Mistral-7B-Instruct-v0.2
- BERT (fine-tuned agency classifier)
Metrics
- Agentic–Communal ratio gap (percent of sentences)
- Inter-group variance of ratio gaps (bias metric)
Datasets
- LABE (this paper's benchmark prompts)
- LAC (Language Agency Classification, 3,724 sentences)
- Bias in Bios (Wikipedia biographies)
- RateMyProfessor sample
- Wan et al. (2023a) reference letter dataset (used as human proxy)
Benchmarks
- LABE

