LABE: a benchmark, dataset, and rewrite method to find and reduce agency (leader vs helper) bias in LLM outputs

April 16, 20248 min

Overview

Decision SnapshotNeeds Validation

The benchmark and dataset are concrete and reproducible; results cover multiple models and tasks. However, evaluations are limited to binary gender, four races, three LLMs, and small mitigation samples, so further validation is needed before high-stakes deployment.

Citations3

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 55%

Novelty: 65%

Authors

Yixin Wan, Kai-Wei Chang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM-generated bios, reviews, and letters can systematically understate leadership for minority groups; this risks reputational harm, unfair downstream decisions, and regulatory scrutiny. Measuring agency bias and applying targeted rewrites reduces that risk.

Who Should Care

Summary TLDR

This paper introduces LABE, a benchmark to measure 'language agency' bias — whether text frames people as agentic (leader/initiative) or communal (helper/supportive). The authors build a labeled dataset (LAC, 3,724 sentences), a BERT agency classifier (91.7% test accuracy), and evaluate three LLMs (ChatGPT gpt-3.5-turbo-1106, Llama3-8B, Mistral-7B) on three tasks (biography, professor review, reference letter). Key findings: LLM outputs are more gender-biased than human text, intersectional groups (e.g., Black women) suffer the most, simple fairness prompts are unstable and can worsen bias, and a targeted Mitigation via Selective Rewrite (MSR) using the classifier is more reliable. Code andデ

Problem Statement

Automatic text generators can encode social bias not only in words but in style: who is described as a leader (agentic) vs a helper (communal). Existing measures are brittle (string matching, sentiment) and there is no comprehensive benchmark that (1) tests gender, race, and intersectional agency bias across common generation tasks and (2) provides reliable automated scoring and mitigation.

Main Contribution

LABE: a template-based benchmark to measure gender, racial, and intersectional agency bias in LLM-generated biographies, professor reviews, and reference letters (5,400 prompts).

LAC: a 3,724-sentence labeled dataset for agentic vs communal sentences; used to train a high-accuracy BERT agency classifier (91.69% test acc).

Key Findings

LLM outputs show larger gender agency bias than comparable human texts.

NumbersGender bias avg (ChatGPT) 34.62, human biographies gender diff 10.12

Practical UseDon't assume model outputs match human fairness: measure your model's agentic vs communal balance and compare to human baselines before deployment.

Evidence RefTable 1; Table 16; Figure 2

Intersectional groups (race+gender) suffer the largest agency gaps; Black women often receive the lowest agentic language.

NumbersIntersectional bias avg (Llama3) 74.11; ChatGPT/Llama3 show lowest agency for Black female professors in reviews

Practical UseEvaluate intersectional slices (e.g., race×gender) because overall averages hide the worst-off groups; prioritize mitigation for those slices.

Evidence RefTable 1; Figure 3; intersectional tables (23–28)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average Gender Bias (Inter-group variance) — ChatGPT34.62Average across biography, professor review, reference letter (Table 1)Table 1 reports 34.62 average gender bias for ChatGPTTable 1
Average Gender Bias — Mistral51.99Average across tasks (Table 1)Table 1 reports 51.99 average gender bias for MistralTable 1

What To Try In 7 Days

Run LABE-style templated prompts on your generation pipeline to spot agentic vs communal gaps.

Train or reuse a small BERT agency classifier (LAC) to label outputs quickly.

Apply a selective-rewrite step (MSR) to low-agency sentences rather than only adding fairness instructions.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Binary gender and only four racial groups were analyzed; other identities are not covered.

Human-written racial and intersectional datasets were scarce; reference-letter human data is a proxy.

When Not To Use

Do not rely on LABE as a final fairness check for non-binary genders or other racial groups not included here.

Avoid using MSR alone for fully automated high-stakes decisions without human review.

Failure Modes

Agency classifier mistakes (false agentic/communal labels) can trigger inappropriate rewrites.

MSR can unevenly boost majority groups more than minority groups, increasing variance across slices.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-1106)Llama3-8B-InstructMistral-7B-Instruct-v0.2BERT (fine-tuned agency classifier)

Metrics

Agentic–Communal ratio gap (percent of sentences)Inter-group variance of ratio gaps (bias metric)

Datasets

LABE (this paper's benchmark prompts)LAC (Language Agency Classification, 3,724 sentences)Bias in Bios (Wikipedia biographies)RateMyProfessor sampleWan et al. (2023a) reference letter dataset (used as human proxy)

Benchmarks

LABE