Overview
The benchmark and dataset are concrete and reproducible; results cover multiple models and tasks. However, evaluations are limited to binary gender, four races, three LLMs, and small mitigation samples, so further validation is needed before high-stakes deployment.
Citations3
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 55%
Novelty: 65%
Why It Matters For Business
LLM-generated bios, reviews, and letters can systematically understate leadership for minority groups; this risks reputational harm, unfair downstream decisions, and regulatory scrutiny. Measuring agency bias and applying targeted rewrites reduces that risk.
Who Should Care
Summary TLDR
This paper introduces LABE, a benchmark to measure 'language agency' bias — whether text frames people as agentic (leader/initiative) or communal (helper/supportive). The authors build a labeled dataset (LAC, 3,724 sentences), a BERT agency classifier (91.7% test accuracy), and evaluate three LLMs (ChatGPT gpt-3.5-turbo-1106, Llama3-8B, Mistral-7B) on three tasks (biography, professor review, reference letter). Key findings: LLM outputs are more gender-biased than human text, intersectional groups (e.g., Black women) suffer the most, simple fairness prompts are unstable and can worsen bias, and a targeted Mitigation via Selective Rewrite (MSR) using the classifier is more reliable. Code andデ
Problem Statement
Automatic text generators can encode social bias not only in words but in style: who is described as a leader (agentic) vs a helper (communal). Existing measures are brittle (string matching, sentiment) and there is no comprehensive benchmark that (1) tests gender, race, and intersectional agency bias across common generation tasks and (2) provides reliable automated scoring and mitigation.
Main Contribution
LABE: a template-based benchmark to measure gender, racial, and intersectional agency bias in LLM-generated biographies, professor reviews, and reference letters (5,400 prompts).
LAC: a 3,724-sentence labeled dataset for agentic vs communal sentences; used to train a high-accuracy BERT agency classifier (91.69% test acc).
Key Findings
LLM outputs show larger gender agency bias than comparable human texts.
Intersectional groups (race+gender) suffer the largest agency gaps; Black women often receive the lowest agentic language.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average Gender Bias (Inter-group variance) — ChatGPT | 34.62 | — | — | Average across biography, professor review, reference letter (Table 1) | Table 1 reports 34.62 average gender bias for ChatGPT | Table 1 |
| Average Gender Bias — Mistral | 51.99 | — | — | Average across tasks (Table 1) | Table 1 reports 51.99 average gender bias for Mistral | Table 1 |
What To Try In 7 Days
Run LABE-style templated prompts on your generation pipeline to spot agentic vs communal gaps.
Train or reuse a small BERT agency classifier (LAC) to label outputs quickly.
Apply a selective-rewrite step (MSR) to low-agency sentences rather than only adding fairness instructions.
Reproducibility
Risks & Boundaries
Limitations
Binary gender and only four racial groups were analyzed; other identities are not covered.
Human-written racial and intersectional datasets were scarce; reference-letter human data is a proxy.
When Not To Use
Do not rely on LABE as a final fairness check for non-binary genders or other racial groups not included here.
Avoid using MSR alone for fully automated high-stakes decisions without human review.
Failure Modes
Agency classifier mistakes (false agentic/communal labels) can trigger inappropriate rewrites.
MSR can unevenly boost majority groups more than minority groups, increasing variance across slices.

