FairPy: Open toolkit to measure and reduce token-level bias in common language models

February 10, 20236 min

Overview

Decision SnapshotNeeds Validation

The toolkit is practical and runnable on common models but has compatibility gaps and provides empirical examples on two models; evidence is moderate and mostly demonstration-level.

Citations6

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Hrishikesh Viswanath, Tianyi Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

FairPy makes bias audits repeatable and faster across multiple metrics and models, but mitigation effects are metric-dependent so teams must validate fixes with several tests before deployment.

Who Should Care

Summary TLDR

This paper surveys token-level bias metrics and mitigation methods and releases FairPy, an open Python toolkit that plugs into HuggingFace models (and custom models) to run a set of bias tests (WEAT/SEAT, Hellinger, StereoSet, Honest, log-likelihood) and apply mitigation methods (dropout retraining, nullspace projection, diff-pruning, self-debias, counterfactual data augmentation). The authors show empirical runs on GPT-2 and BERT where retraining on a counterfactually augmented Yelp subset produced mixed metric changes, illustrating that a single mitigation does not uniformly improve all bias measures. Code: https://github.com/HrishikeshVish/Fairpy.

Problem Statement

Large pretrained language models inherit statistical biases from training corpora. Existing bias metrics and debiasing methods are scattered, often tied to specific model types or templates, and are hard to plug into standard development workflows. Practitioners need a unified, modular toolkit to run multiple bias tests and mitigation methods on common models.

Main Contribution

A modular Python toolkit (FairPy) that runs many existing bias-detection metrics and mitigation methods on HuggingFace models and custom models.

Decouples metrics from particular model internals and evaluation scripts, offering plug-and-play model and tokenizer interfaces.

Key Findings

FairPy collects common bias metrics and mitigation methods into one toolkit.

Practical UseTry FairPy to run many standard bias checks and mitigation methods quickly on your model instead of wiring many separate scripts.

Evidence RefSystem Overview; Section 3; toolkit lists metrics and mitigation methods

Dropout-based retraining on a counterfactually augmented Yelp subset changed GPT-2 bias metrics inconsistently.

NumbersWEAT 1.150.80; Hellinger 0.140.35; StereoSet 47.9158.40 (Table 1)

Practical UseDon’t assume one mitigation uniformly reduces bias—measure several metrics because some improve while others worsen.

Evidence RefTable 1 (Empirical Analysis)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Hellinger Distance (GPT-2)biased 0.14 → debiased 0.35biased 0.14+0.21Yelp-small counterfactually augmented, Table 1Table 1 shows Hellinger increased after dropout retrainingTable 1
WEAT Score (GPT-2)biased 1.15 → debiased 0.80biased 1.15-0.35Yelp-small counterfactually augmented, Table 1Table 1 shows WEAT effect size decreased after retrainingTable 1

What To Try In 7 Days

Run FairPy on a production or research model to get a baseline across WEAT/SEAT, StereoSet, Honest, Hellinger, and log-likelihood.

Compare results across at least three metrics and split results by semantic category (the toolkit supports category splits).

Try a small counterfactual data augmentation + dropout retrain on a subset and re-run metrics to see mixed effects before scaling changes.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Compatibility issues with models that use subword tokenization or nonstandard final-layer naming

Not all published debiasing methods are included due to availability or compatibility

When Not To Use

For languages or models outside the supported HuggingFace list without first checking tokenization

As a single source of truth for fairness: metrics can disagree and need human judgment

Failure Modes

A metric fails silently if tokens are split into subword pieces (metric expects whole-word probabilities)

NullSpace Projection may be impossible if final output embeddings are not exposed or named inconsistently

Core Entities

Models

CTRLGPT-2GPTTransfoXLBERTDistilBERTRoBERTaXLMXLNetALBERT

Metrics

Hellinger DistanceWEAT/SEATStereoSet ScoreHonest ScoreLog LikelihoodF1 Score

Datasets

YelpRedditWikipedia (English)StereoSetCrowS-pairsWinoBias

Benchmarks

StereoSetCrowS-pairsWinoBias