FairPy: Open toolkit to measure and reduce token-level bias in common language models

February 10, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

6

Authors

Hrishikesh Viswanath, Tianyi Zhang

Links

Abstract / PDF

Why It Matters For Business

FairPy makes bias audits repeatable and faster across multiple metrics and models, but mitigation effects are metric-dependent so teams must validate fixes with several tests before deployment.

Summary TLDR

This paper surveys token-level bias metrics and mitigation methods and releases FairPy, an open Python toolkit that plugs into HuggingFace models (and custom models) to run a set of bias tests (WEAT/SEAT, Hellinger, StereoSet, Honest, log-likelihood) and apply mitigation methods (dropout retraining, nullspace projection, diff-pruning, self-debias, counterfactual data augmentation). The authors show empirical runs on GPT-2 and BERT where retraining on a counterfactually augmented Yelp subset produced mixed metric changes, illustrating that a single mitigation does not uniformly improve all bias measures. Code: https://github.com/HrishikeshVish/Fairpy.

Problem Statement

Large pretrained language models inherit statistical biases from training corpora. Existing bias metrics and debiasing methods are scattered, often tied to specific model types or templates, and are hard to plug into standard development workflows. Practitioners need a unified, modular toolkit to run multiple bias tests and mitigation methods on common models.

Main Contribution

A modular Python toolkit (FairPy) that runs many existing bias-detection metrics and mitigation methods on HuggingFace models and custom models.

Decouples metrics from particular model internals and evaluation scripts, offering plug-and-play model and tokenizer interfaces.

An empirical survey and demonstration showing how several metrics and mitigation methods behave on GPT-2 and BERT.

Key Findings

FairPy collects common bias metrics and mitigation methods into one toolkit.

Dropout-based retraining on a counterfactually augmented Yelp subset changed GPT-2 bias metrics inconsistently.

NumbersWEAT 1.15→0.80; Hellinger 0.14→0.35; StereoSet 47.91→58.40 (Table 1)

BERT showed reduced WEAT effect size but mixed changes across other metrics after retraining.

NumbersWEAT 1.13→0.68; LogProb 57.25→54.58; F1 64.4→65.6 (Table 2)

Some metrics fail on models with subword tokenization or nonstandard final embedding layers.

Results

Hellinger Distance (GPT-2)

Valuebiased 0.14 → debiased 0.35

Baselinebiased 0.14

WEAT Score (GPT-2)

Valuebiased 1.15 → debiased 0.80

Baselinebiased 1.15

StereoSet Score (GPT-2)

Valuebiased 47.91 → debiased 58.40

Baselinebiased 47.91

WEAT Score (BERT)

Valuebiased 1.13 → debiased 0.68

Baselinebiased 1.13

Log Probability (BERT)

Valuebiased 57.25 → debiased 54.58

Baselinebiased 57.25

F1 Score (BERT)

Valuebiased 64.40 → debiased 65.60

Baselinebiased 64.40

Who Should Care

What To Try In 7 Days

Run FairPy on a production or research model to get a baseline across WEAT/SEAT, StereoSet, Honest, Hellinger, and log-likelihood.

Compare results across at least three metrics and split results by semantic category (the toolkit supports category splits).

Try a small counterfactual data augmentation + dropout retrain on a subset and re-run metrics to see mixed effects before scaling changes.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Compatibility issues with models that use subword tokenization or nonstandard final-layer naming
  • Not all published debiasing methods are included due to availability or compatibility
  • No web UI or CI badge integration at time of writing
  • No support for cascading or concurrent application of multiple mitigation methods

When Not To Use

  • For languages or models outside the supported HuggingFace list without first checking tokenization
  • As a single source of truth for fairness: metrics can disagree and need human judgment
  • For regulatory compliance without additional audits and downstream testing

Failure Modes

  • A metric fails silently if tokens are split into subword pieces (metric expects whole-word probabilities)
  • NullSpace Projection may be impossible if final output embeddings are not exposed or named inconsistently
  • A mitigation reduces one bias metric while worsening others

Core Entities

Models

  • CTRL
  • GPT-2
  • GPT
  • TransfoXL
  • BERT
  • DistilBERT
  • RoBERTa
  • XLM
  • XLNet
  • ALBERT

Metrics

  • Hellinger Distance
  • WEAT/SEAT
  • StereoSet Score
  • Honest Score
  • Log Likelihood
  • F1 Score

Datasets

  • Yelp
  • Reddit
  • Wikipedia (English)
  • StereoSet
  • CrowS-pairs
  • WinoBias

Benchmarks

  • StereoSet
  • CrowS-pairs
  • WinoBias