FairPy: Open toolkit to measure and reduce token-level bias in common language models

Overview

Decision SnapshotNeeds Validation

The toolkit is practical and runnable on common models but has compatibility gaps and provides empirical examples on two models; evidence is moderate and mostly demonstration-level.

Citations6

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Hrishikesh Viswanath, Tianyi Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

FairPy makes bias audits repeatable and faster across multiple metrics and models, but mitigation effects are metric-dependent so teams must validate fixes with several tests before deployment.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

This paper surveys token-level bias metrics and mitigation methods and releases FairPy, an open Python toolkit that plugs into HuggingFace models (and custom models) to run a set of bias tests (WEAT/SEAT, Hellinger, StereoSet, Honest, log-likelihood) and apply mitigation methods (dropout retraining, nullspace projection, diff-pruning, self-debias, counterfactual data augmentation). The authors show empirical runs on GPT-2 and BERT where retraining on a counterfactually augmented Yelp subset produced mixed metric changes, illustrating that a single mitigation does not uniformly improve all bias measures. Code: https://github.com/HrishikeshVish/Fairpy.

Problem Statement

Large pretrained language models inherit statistical biases from training corpora. Existing bias metrics and debiasing methods are scattered, often tied to specific model types or templates, and are hard to plug into standard development workflows. Practitioners need a unified, modular toolkit to run multiple bias tests and mitigation methods on common models.

Main Contribution

A modular Python toolkit (FairPy) that runs many existing bias-detection metrics and mitigation methods on HuggingFace models and custom models.

Decouples metrics from particular model internals and evaluation scripts, offering plug-and-play model and tokenizer interfaces.

Key Findings

FairPy collects common bias metrics and mitigation methods into one toolkit.

Practical UseTry FairPy to run many standard bias checks and mitigation methods quickly on your model instead of wiring many separate scripts.

Evidence RefSystem Overview; Section 3; toolkit lists metrics and mitigation methods

Dropout-based retraining on a counterfactually augmented Yelp subset changed GPT-2 bias metrics inconsistently.

NumbersWEAT 1.15→0.80; Hellinger 0.14→0.35; StereoSet 47.91→58.40 (Table 1)

Practical UseDon’t assume one mitigation uniformly reduces bias—measure several metrics because some improve while others worsen.

Evidence RefTable 1 (Empirical Analysis)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Hellinger Distance (GPT-2)	biased 0.14 → debiased 0.35	biased 0.14	+0.21	Yelp-small counterfactually augmented, Table 1	Table 1 shows Hellinger increased after dropout retraining	Table 1
WEAT Score (GPT-2)	biased 1.15 → debiased 0.80	biased 1.15	-0.35	Yelp-small counterfactually augmented, Table 1	Table 1 shows WEAT effect size decreased after retraining	Table 1

What To Try In 7 Days

Run FairPy on a production or research model to get a baseline across WEAT/SEAT, StereoSet, Honest, Hellinger, and log-likelihood.

Compare results across at least three metrics and split results by semantic category (the toolkit supports category splits).

Try a small counterfactual data augmentation + dropout retrain on a subset and re-run metrics to see mixed effects before scaling changes.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/HrishikeshVish/Fairpy

Risks & Boundaries

Limitations

Compatibility issues with models that use subword tokenization or nonstandard final-layer naming

Not all published debiasing methods are included due to availability or compatibility

When Not To Use

For languages or models outside the supported HuggingFace list without first checking tokenization

As a single source of truth for fairness: metrics can disagree and need human judgment

Failure Modes

A metric fails silently if tokens are split into subword pieces (metric expects whole-word probabilities)

NullSpace Projection may be impossible if final output embeddings are not exposed or named inconsistently

Core Entities

Models

CTRLGPT-2GPTTransfoXLBERTDistilBERTRoBERTaXLMXLNetALBERT

Metrics

Hellinger DistanceWEAT/SEATStereoSet ScoreHonest ScoreLog LikelihoodF1 Score

Datasets

YelpRedditWikipedia (English)StereoSetCrowS-pairsWinoBias

Benchmarks

StereoSetCrowS-pairsWinoBias

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FairPy collects common bias metrics and mitigation methods into one toolkit.

Dropout-based retraining on a counterfactually augmented Yelp subset changed GPT-2 bias metrics inconsistently.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

BIASSCOPE: an automated LLM-driven pipeline that finds evaluation biases and builds a tougher JudgeBench‑Pro

Key finding

Pairwise comparisons amplify stylistic distractions; absolute scoring is more robust

Key finding

Psychometric audit finds durable provider-level biases in LLMs that can compound across multi-model systems

Key finding

JudgeBiasBench: a 12-type benchmark and bias-aware training to reduce LLM-judge bias

Key finding

Alignment makes LLM evaluators overuse certain scores; prompt score range is a cheap fix

Key finding