LABE: a benchmark, dataset, and rewrite method to find and reduce agency (leader vs helper) bias in LLM outputs

Overview

Decision SnapshotNeeds Validation

The benchmark and dataset are concrete and reproducible; results cover multiple models and tasks. However, evaluations are limited to binary gender, four races, three LLMs, and small mitigation samples, so further validation is needed before high-stakes deployment.

Citations3

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 55%

Novelty: 65%

Authors

Yixin Wan, Kai-Wei Chang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM-generated bios, reviews, and letters can systematically understate leadership for minority groups; this risks reputational harm, unfair downstream decisions, and regulatory scrutiny. Measuring agency bias and applying targeted rewrites reduces that risk.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Data Scientist

Summary TLDR

This paper introduces LABE, a benchmark to measure 'language agency' bias — whether text frames people as agentic (leader/initiative) or communal (helper/supportive). The authors build a labeled dataset (LAC, 3,724 sentences), a BERT agency classifier (91.7% test accuracy), and evaluate three LLMs (ChatGPT gpt-3.5-turbo-1106, Llama3-8B, Mistral-7B) on three tasks (biography, professor review, reference letter). Key findings: LLM outputs are more gender-biased than human text, intersectional groups (e.g., Black women) suffer the most, simple fairness prompts are unstable and can worsen bias, and a targeted Mitigation via Selective Rewrite (MSR) using the classifier is more reliable. Code andデ

Problem Statement

Automatic text generators can encode social bias not only in words but in style: who is described as a leader (agentic) vs a helper (communal). Existing measures are brittle (string matching, sentiment) and there is no comprehensive benchmark that (1) tests gender, race, and intersectional agency bias across common generation tasks and (2) provides reliable automated scoring and mitigation.

Main Contribution

LABE: a template-based benchmark to measure gender, racial, and intersectional agency bias in LLM-generated biographies, professor reviews, and reference letters (5,400 prompts).

LAC: a 3,724-sentence labeled dataset for agentic vs communal sentences; used to train a high-accuracy BERT agency classifier (91.69% test acc).

Key Findings

LLM outputs show larger gender agency bias than comparable human texts.

NumbersGender bias avg (ChatGPT) 34.62, human biographies gender diff 10.12

Practical UseDon't assume model outputs match human fairness: measure your model's agentic vs communal balance and compare to human baselines before deployment.

Evidence RefTable 1; Table 16; Figure 2

Intersectional groups (race+gender) suffer the largest agency gaps; Black women often receive the lowest agentic language.

NumbersIntersectional bias avg (Llama3) 74.11; ChatGPT/Llama3 show lowest agency for Black female professors in reviews

Practical UseEvaluate intersectional slices (e.g., race×gender) because overall averages hide the worst-off groups; prioritize mitigation for those slices.

Evidence RefTable 1; Figure 3; intersectional tables (23–28)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average Gender Bias (Inter-group variance) — ChatGPT	34.62	—	—	Average across biography, professor review, reference letter (Table 1)	Table 1 reports 34.62 average gender bias for ChatGPT	Table 1
Average Gender Bias — Mistral	51.99	—	—	Average across tasks (Table 1)	Table 1 reports 51.99 average gender bias for Mistral	Table 1

What To Try In 7 Days

Run LABE-style templated prompts on your generation pipeline to spot agentic vs communal gaps.

Train or reuse a small BERT agency classifier (LAC) to label outputs quickly.

Apply a selective-rewrite step (MSR) to low-agency sentences rather than only adding fairness instructions.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/elainew728/labe-agency

Data URLs

https://github.com/elainew728/labe-agency

Risks & Boundaries

Limitations

Binary gender and only four racial groups were analyzed; other identities are not covered.

Human-written racial and intersectional datasets were scarce; reference-letter human data is a proxy.

When Not To Use

Do not rely on LABE as a final fairness check for non-binary genders or other racial groups not included here.

Avoid using MSR alone for fully automated high-stakes decisions without human review.

Failure Modes

Agency classifier mistakes (false agentic/communal labels) can trigger inappropriate rewrites.

MSR can unevenly boost majority groups more than minority groups, increasing variance across slices.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-1106)Llama3-8B-InstructMistral-7B-Instruct-v0.2BERT (fine-tuned agency classifier)

Metrics

Agentic–Communal ratio gap (percent of sentences)Inter-group variance of ratio gaps (bias metric)

Datasets

LABE (this paper's benchmark prompts)LAC (Language Agency Classification, 3,724 sentences)Bias in Bios (Wikipedia biographies)RateMyProfessor sampleWan et al. (2023a) reference letter dataset (used as human proxy)

Benchmarks

LABE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM outputs show larger gender agency bias than comparable human texts.

Intersectional groups (race+gender) suffer the largest agency gaps; Black women often receive the lowest agentic language.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding