Overview
The toolkit is practical and runnable on common models but has compatibility gaps and provides empirical examples on two models; evidence is moderate and mostly demonstration-level.
Citations6
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 2/4
Findings with evidence refs: 4/4
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
FairPy makes bias audits repeatable and faster across multiple metrics and models, but mitigation effects are metric-dependent so teams must validate fixes with several tests before deployment.
Who Should Care
Summary TLDR
This paper surveys token-level bias metrics and mitigation methods and releases FairPy, an open Python toolkit that plugs into HuggingFace models (and custom models) to run a set of bias tests (WEAT/SEAT, Hellinger, StereoSet, Honest, log-likelihood) and apply mitigation methods (dropout retraining, nullspace projection, diff-pruning, self-debias, counterfactual data augmentation). The authors show empirical runs on GPT-2 and BERT where retraining on a counterfactually augmented Yelp subset produced mixed metric changes, illustrating that a single mitigation does not uniformly improve all bias measures. Code: https://github.com/HrishikeshVish/Fairpy.
Problem Statement
Large pretrained language models inherit statistical biases from training corpora. Existing bias metrics and debiasing methods are scattered, often tied to specific model types or templates, and are hard to plug into standard development workflows. Practitioners need a unified, modular toolkit to run multiple bias tests and mitigation methods on common models.
Main Contribution
A modular Python toolkit (FairPy) that runs many existing bias-detection metrics and mitigation methods on HuggingFace models and custom models.
Decouples metrics from particular model internals and evaluation scripts, offering plug-and-play model and tokenizer interfaces.
Key Findings
FairPy collects common bias metrics and mitigation methods into one toolkit.
Dropout-based retraining on a counterfactually augmented Yelp subset changed GPT-2 bias metrics inconsistently.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Hellinger Distance (GPT-2) | biased 0.14 → debiased 0.35 | biased 0.14 | +0.21 | Yelp-small counterfactually augmented, Table 1 | Table 1 shows Hellinger increased after dropout retraining | Table 1 |
| WEAT Score (GPT-2) | biased 1.15 → debiased 0.80 | biased 1.15 | -0.35 | Yelp-small counterfactually augmented, Table 1 | Table 1 shows WEAT effect size decreased after retraining | Table 1 |
What To Try In 7 Days
Run FairPy on a production or research model to get a baseline across WEAT/SEAT, StereoSet, Honest, Hellinger, and log-likelihood.
Compare results across at least three metrics and split results by semantic category (the toolkit supports category splits).
Try a small counterfactual data augmentation + dropout retrain on a subset and re-run metrics to see mixed effects before scaling changes.
Reproducibility
Risks & Boundaries
Limitations
Compatibility issues with models that use subword tokenization or nonstandard final-layer naming
Not all published debiasing methods are included due to availability or compatibility
When Not To Use
For languages or models outside the supported HuggingFace list without first checking tokenization
As a single source of truth for fairness: metrics can disagree and need human judgment
Failure Modes
A metric fails silently if tokens are split into subword pieces (metric expects whole-word probabilities)
NullSpace Projection may be impossible if final output embeddings are not exposed or named inconsistently

