Tune lightweight prompts with counterfactual contrastive loss to reduce gender bias on downstream tasks

October 19, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

2

Authors

Xiangjue Dong, Ziwei Zhu, Zhuoer Wang, Maria Teleki, James Caverlee

Links

Abstract / PDF

Why It Matters For Business

Co^2PT offers a low-cost way to reduce downstream gender bias: it freezes the main model, tunes small prompts, and avoids costly full-model retraining while lowering fairness gaps on real tasks.

Summary TLDR

Co^2PT is a parameter-efficient method that freezes a pretrained model and learns continuous prompts that (1) use counterfactually augmented training examples (swap demographic terms) and (2) apply a contrastive loss to make representations of counterfactual pairs similar. On three downstream, extrinsic bias benchmarks (Bias-STS-B, Bias-NLI, Bias-in-Bios) Co^2PT substantially reduces measured gender bias versus standard prompt tuning and several prior debiasing baselines while keeping task performance near prior baselines. The code and data are published.

Problem Statement

Pretrained language models often encode social biases that can re-emerge or amplify during downstream fine-tuning. Existing "debias-then-finetune" approaches focus on upstream changes and intrinsic metrics, but they do not reliably prevent bias in downstream tasks. The paper asks: can we efficiently inject debiasing directly during prompt tuning so downstream tasks stay fair without retraining the whole model?

Main Contribution

Co^2PT: a debias-while-prompt-tuning method that (a) freezes the PLM, (b) adds continuous prompts at every layer, (c) builds counterfactual pairs by swapping demographic terms, and (d) uses a contrastive loss between counterfactual pairs while optimizing task loss.

Empirical evidence on three extrinsic bias benchmarks (Bias-STS-B, Bias-NLI, Bias-in-Bios) that Co^2PT reduces downstream bias metrics substantially versus prompt tuning and many prior debiasing baselines.

Showed Co^2PT can be applied on top of existing upstream debiased models to further lower downstream bias, and provided ablations that identify which components matter most (counterfactual augmentation + contrastive loss).

Key Findings

On Bias-STS-B, Co^2PT cuts the average absolute similarity-difference (Diff) from PT's 0.321 to 0.058.

NumbersDiff: PT 0.321 -> Co^2PT 0.058 (Table 2)

On Bias-NLI Co^2PT raises neutrality metrics: Net Neutral (NN) from PT 0.741 to 0.877 and Fraction Neutral (FN) from 0.812 to 0.965.

NumbersNN: 0.741 -> 0.877; FN: 0.812 -> 0.965 (Table 3)

On Bias-in-Bios Co^2PT reduces per-occupation gender TPR gap (GAP TPR) from PT 3.171 to 2.537 while keeping accuracy near prior baselines.

NumbersGAP TPR: 3.171 -> 2.537; Acc roughly 0.82 (Table 4)

Ablation shows both modules matter: simple counterfactual data augmentation (PT+CDA) reduces Diff from 0.321 to 0.291, unsupervised contrastive (PT+SCL) reduces to 0.161, but combining task-specific counterfactual pairs with contrastive prompt tuning (Co^2PT) performs best at 0.0

NumbersPT 0.321; PT+CDA 0.291; PT+SCL 0.161; Co^2PT 0.058 (Table 6)

Co^2PT can be applied on top of existing debiased models; e.g., Context-Debias Diff drops from 0.332 to 0.088 after adding Co^2PT.

NumbersContext-Debias: 0.332 -> +Co^2PT 0.088 (Table 5)

Results

Bias-STS-B Diff (lower is better)

Value0.058

BaselinePT 0.321

Bias-STS-B τ>0.1 fraction (lower is better)

Value0.167

BaselinePT 0.749

Bias-NLI Net Neutral (NN, higher is better)

Value0.877

BaselinePT 0.741

Bias-NLI Fraction Neutral (FN, higher is better)

Value0.965

BaselinePT 0.812

Bias-in-Bios GAP TPR (lower is better)

Value2.537

BaselinePT 3.171

Who Should Care

What To Try In 7 Days

Clone the authors' repo and run their Bias-STS-B/Bias-NLI evaluation on your BERT-based pipeline.

Build simple counterfactual swaps for your task (swap demographic tokens) and add contrastive prompt tuning with α=1, τ=0.05 and prompt length 20.

If you already use an upstream debiased checkpoint, apply Co^2PT during downstream tuning and measure TPR gaps per group.

Optimization Features

System Optimization

  • single-GPU experiments (RTX A5000, 24GB)

Training Optimization

  • parameter-efficient prompt tuning (freeze model, tune prompts)

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluation focuses on binary gender and English; non-binary and non-English cases are not empirically addressed.
  • Co^2PT uses prompt tuning which can underperform full fine-tuning on some small models or datasets.
  • Effectiveness depends on quality of counterfactual term lists and presence of demographic tokens in training data.

When Not To Use

  • If your target bias attribute is not present in training text and cannot be reliably counterswapped.
  • If you require debiasing that was validated in non-English languages without additional work.
  • If you can afford full fine-tuning and have small models where fine-tuning outperforms prompt tuning.

Failure Modes

  • Too-short prompts reduce debiasing power (prompt length sensitivity shown).
  • High contrastive temperature τ (e.g., 0.5) or low α weakens contrastive signal and reduces benefit.
  • Using average-token pooling instead of [CLS] made prompts harder to acquire debiasing (appendix).

Core Entities

Models

  • Co2PT
  • PT (deep prompt tuning)
  • BERT
  • BERT+CDA
  • ZariCDA
  • ZariDO
  • ADELE
  • ADELE-TA
  • Context-Debias
  • Auto-Debias
  • MABEL

Metrics

  • Diff (avg absolute similarity difference)
  • τ thresholds (τ=0.1,0.3)
  • Pearson / Spearman correlations
  • Net Neutral (NN)
  • Fraction Neutral (FN)
  • Threshold T:0.5 / T:0.7
  • GAP TPR
  • GAP RMS
  • Accuracy

Datasets

  • STS-B
  • SNLI
  • Bias-STS-B
  • Bias-NLI
  • Bias-in-Bios
  • Common Crawl (for bios source)

Benchmarks

  • Bias-STS-B
  • Bias-NLI
  • Bias-in-Bios