Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
6
Why It Matters For Business
FairPy makes bias audits repeatable and faster across multiple metrics and models, but mitigation effects are metric-dependent so teams must validate fixes with several tests before deployment.
Summary TLDR
This paper surveys token-level bias metrics and mitigation methods and releases FairPy, an open Python toolkit that plugs into HuggingFace models (and custom models) to run a set of bias tests (WEAT/SEAT, Hellinger, StereoSet, Honest, log-likelihood) and apply mitigation methods (dropout retraining, nullspace projection, diff-pruning, self-debias, counterfactual data augmentation). The authors show empirical runs on GPT-2 and BERT where retraining on a counterfactually augmented Yelp subset produced mixed metric changes, illustrating that a single mitigation does not uniformly improve all bias measures. Code: https://github.com/HrishikeshVish/Fairpy.
Problem Statement
Large pretrained language models inherit statistical biases from training corpora. Existing bias metrics and debiasing methods are scattered, often tied to specific model types or templates, and are hard to plug into standard development workflows. Practitioners need a unified, modular toolkit to run multiple bias tests and mitigation methods on common models.
Main Contribution
A modular Python toolkit (FairPy) that runs many existing bias-detection metrics and mitigation methods on HuggingFace models and custom models.
Decouples metrics from particular model internals and evaluation scripts, offering plug-and-play model and tokenizer interfaces.
An empirical survey and demonstration showing how several metrics and mitigation methods behave on GPT-2 and BERT.
Key Findings
FairPy collects common bias metrics and mitigation methods into one toolkit.
Dropout-based retraining on a counterfactually augmented Yelp subset changed GPT-2 bias metrics inconsistently.
BERT showed reduced WEAT effect size but mixed changes across other metrics after retraining.
Some metrics fail on models with subword tokenization or nonstandard final embedding layers.
Results
Hellinger Distance (GPT-2)
WEAT Score (GPT-2)
StereoSet Score (GPT-2)
WEAT Score (BERT)
Log Probability (BERT)
F1 Score (BERT)
Who Should Care
What To Try In 7 Days
Run FairPy on a production or research model to get a baseline across WEAT/SEAT, StereoSet, Honest, Hellinger, and log-likelihood.
Compare results across at least three metrics and split results by semantic category (the toolkit supports category splits).
Try a small counterfactual data augmentation + dropout retrain on a subset and re-run metrics to see mixed effects before scaling changes.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Compatibility issues with models that use subword tokenization or nonstandard final-layer naming
- Not all published debiasing methods are included due to availability or compatibility
- No web UI or CI badge integration at time of writing
- No support for cascading or concurrent application of multiple mitigation methods
When Not To Use
- For languages or models outside the supported HuggingFace list without first checking tokenization
- As a single source of truth for fairness: metrics can disagree and need human judgment
- For regulatory compliance without additional audits and downstream testing
Failure Modes
- A metric fails silently if tokens are split into subword pieces (metric expects whole-word probabilities)
- NullSpace Projection may be impossible if final output embeddings are not exposed or named inconsistently
- A mitigation reduces one bias metric while worsening others
Core Entities
Models
- CTRL
- GPT-2
- GPT
- TransfoXL
- BERT
- DistilBERT
- RoBERTa
- XLM
- XLNet
- ALBERT
Metrics
- Hellinger Distance
- WEAT/SEAT
- StereoSet Score
- Honest Score
- Log Likelihood
- F1 Score
Datasets
- Yelp
- Wikipedia (English)
- StereoSet
- CrowS-pairs
- WinoBias
Benchmarks
- StereoSet
- CrowS-pairs
- WinoBias

