A decision tree + open-source toolkit that maps your LLM use case and prompt sample to concrete bias and fairness metrics.

July 15, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

2

Authors

Dylan Bouchard

Links

Abstract / PDF

Why It Matters For Business

Fairness risk depends on your real prompts. Running the right metrics on your own prompt sample gives more realistic risk estimates than off-the-shelf benchmarks and helps avoid costly deployment errors or reputational harm.

Summary TLDR

The paper introduces a practical decision framework and Python library (LangFair) that tells you which bias/fairness metrics to run for a specific LLM deployment. It emphasizes evaluating on your actual prompt population, adds counterfactual similarity and stereotype-classifier metrics, and shows that prompt choice often matters more than model choice for fairness risk.

Problem Statement

Existing fairness checks focus on model-level benchmarks and lack principled guidance for which metrics matter for a given deployment. That leads teams to miss prompt-specific risks and pick irrelevant metrics.

Main Contribution

A decision framework that maps a use case (model + prompt population) to applicable fairness metrics via task type, whether prompts mention protected attributes (FTU), and stakeholder priorities.

New, output-only metrics: counterfactual adaptations of ROUGE, BLEU, cosine similarity, and a stereotype-classifier-derived score; plus a taxonomy linking risks to task archetypes.

An open-source Python package (LangFair) that generates responses, creates counterfactual pairs, and computes the recommended metrics; experiments across 5 LLMs × 5 prompt populations illustrate context dependence of fairness.

Key Findings

Fairness risk depends far more on prompt population than on model choice.

NumbersToxic Fraction varied up to 60× and 129× across prompt sets (GPT-4o: 0.181→0.003; Gemini‑2.5‑Flash‑Lite: 0.645→0.005).

Stereotype detection choice matters: classifier-based metrics flagged much higher stereotyping on targeted prompts.

NumbersGemini‑2.5‑Flash produced stereotypical outputs 28.4% on DT‑Stereo vs 5.0% on RTP‑N (Stereotype Fraction).

Counterfactual similarity captures distinct risks even where toxicity is low.

NumbersGemini‑Flash‑Lite C‑Cosine dropped from 0.900 (DialogSum) to 0.510 (Open‑CF), a 43% reduction, despite near-zero Toxicic

No model was uniformly safe across all prompt populations.

NumbersAll five models produced non-zero Toxic Fraction and non-zero stereotype fractions on at least some datasets (Tables 2–3

Results

Toxic Fraction

ValueGPT-4o: 0.181 on RTP-C; 0.003 on RTP-N

Toxic Fraction

ValueGemini‑2.5‑Flash‑Lite: 0.645 on RTP-C; 0.005 on RTP-N

Stereotype Fraction

ValueGemini‑2.5‑Flash: 0.284 on DT‑Stereo; 0.050 on RTP‑N

C-Cosine Similarity

ValueGemini‑2.5‑Flash‑Lite: 0.900 (DialogSum) vs 0.510 (Open‑CF)

Who Should Care

What To Try In 7 Days

Run LangFair on a representative sample of your production prompts and review Toxic Fraction, Stereotype Fraction, and a counterfactual similarity metric.

Check FTU (fairness through unawareness): identify whether prompts contain protected attribute terms and decide if counterfactual invariance is required.

Set simple alerts: flag outputs with high toxicity or low counterfactual similarity for human review and iterate thresholds with stakeholders.

Reproducibility

Data Urls

  • RealToxicityPrompts (Gehman et al., 2020)
  • DialogSum (Chen et al., 2021)
  • DecodingTrust (Wang et al., 2024a)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Text-only, single-turn scope: does not handle multi-turn dynamics or multimodal inputs.
  • Requires a known/representative prompt population; public chatbots need runtime monitoring instead.
  • Counterfactual methods depend on lexicons; lexicons are imperfect across cultures and evolving identities.
  • Framework suggests metrics but does not set operational thresholds; thresholds need stakeholder input.

When Not To Use

  • Open-ended public chatbots where you cannot sample a representative prompt population (use response-level monitoring instead).
  • Multimodal or multi-turn agent pipelines without adapting metrics per stage.

Failure Modes

  • Overreliance on co-occurrence metrics that under-detect prompt-driven stereotype effects.
  • Choosing metrics inconsistent with stakeholder priorities (e.g., using representational metrics when error-based fairness matters).
  • Incomplete lexicons causing missed counterfactual pairs and false sense of safety.

Core Entities

Models

  • GPT-4o
  • GPT-4o-mini
  • Gemini-2.5-Flash
  • Gemini-2.5-Flash-Lite
  • Gemini-2.5-Pro

Metrics

  • Toxic Fraction
  • Stereotype Fraction
  • Co-Occurrence Bias Score (COBS)
  • Stereotypical Associations (SA)
  • Counterfactual ROUGE-L (C-ROUGE-L)
  • Counterfactual BLEU (C-BLEU)
  • Counterfactual Cosine Similarity (C-Cosine)
  • Counterfactual Sentiment Parity (CSP)
  • Demographic Parity
  • Disparate Impact
  • False Positive/Negative/Omission/Discovery Rate Differences
  • Jaccard-K
  • SERP-K
  • PRAG-K

Datasets

  • RealToxicityPrompts (RTP-C and RTP-N)
  • DialogSum
  • DecodingTrust Stereotype (DT-Stereo)
  • Open-Counterfactual (Open-CF)