Tune lightweight prompts with counterfactual contrastive loss to reduce gender bias on downstream tasks

October 19, 20238 min

Overview

Decision SnapshotNeeds Validation

The method is easy to add to BERT-style pipelines, tested on three real downstream bias benchmarks with multiple runs and ablations, but results are shown on BERT-size models and focused on binary gender in English.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Xiangjue Dong, Ziwei Zhu, Zhuoer Wang, Maria Teleki, James Caverlee

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Co^2PT offers a low-cost way to reduce downstream gender bias: it freezes the main model, tunes small prompts, and avoids costly full-model retraining while lowering fairness gaps on real tasks.

Who Should Care

Summary TLDR

Co^2PT is a parameter-efficient method that freezes a pretrained model and learns continuous prompts that (1) use counterfactually augmented training examples (swap demographic terms) and (2) apply a contrastive loss to make representations of counterfactual pairs similar. On three downstream, extrinsic bias benchmarks (Bias-STS-B, Bias-NLI, Bias-in-Bios) Co^2PT substantially reduces measured gender bias versus standard prompt tuning and several prior debiasing baselines while keeping task performance near prior baselines. The code and data are published.

Problem Statement

Pretrained language models often encode social biases that can re-emerge or amplify during downstream fine-tuning. Existing "debias-then-finetune" approaches focus on upstream changes and intrinsic metrics, but they do not reliably prevent bias in downstream tasks. The paper asks: can we efficiently inject debiasing directly during prompt tuning so downstream tasks stay fair without retraining the whole model?

Main Contribution

Co^2PT: a debias-while-prompt-tuning method that (a) freezes the PLM, (b) adds continuous prompts at every layer, (c) builds counterfactual pairs by swapping demographic terms, and (d) uses a contrastive loss between counterfactual pairs while optimizing task loss.

Empirical evidence on three extrinsic bias benchmarks (Bias-STS-B, Bias-NLI, Bias-in-Bios) that Co^2PT reduces downstream bias metrics substantially versus prompt tuning and many prior debiasing baselines.

Key Findings

On Bias-STS-B, Co^2PT cuts the average absolute similarity-difference (Diff) from PT's 0.321 to 0.058.

NumbersDiff: PT 0.321 -> Co^2PT 0.058 (Table 2)

Practical UseIf you use prompt tuning on semantic-similarity tasks, add counterfactual pairs + contrastive prompt tuning to avoid large bias amplification.

Evidence RefTable 2

On Bias-NLI Co^2PT raises neutrality metrics: Net Neutral (NN) from PT 0.741 to 0.877 and Fraction Neutral (FN) from 0.812 to 0.965.

NumbersNN: 0.741 -> 0.877; FN: 0.812 -> 0.965 (Table 3)

Practical UseFor NLI-style tasks, Co^2PT makes the model predict neutral labels much more often for gender-neutral inputs, reducing gender-occupation bias in outputs.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Bias-STS-B Diff (lower is better)0.058PT 0.321-0.263Bias-STS-BCo^2PT Diff 0.058 vs PT 0.321 (Table 2)Table 2
Bias-STS-B τ>0.1 fraction (lower is better)0.167PT 0.749-0.582Bias-STS-BCo^2PT 0.167 vs PT 0.749 (Table 2)Table 2

What To Try In 7 Days

Clone the authors' repo and run their Bias-STS-B/Bias-NLI evaluation on your BERT-based pipeline.

Build simple counterfactual swaps for your task (swap demographic tokens) and add contrastive prompt tuning with α=1, τ=0.05 and prompt length 20.

If you already use an upstream debiased checkpoint, apply Co^2PT during downstream tuning and measure TPR gaps per group.

Optimization Features

System Optimization
single-GPU experiments (RTX A5000, 24GB)
Training Optimization
parameter-efficient prompt tuning (freeze model, tune prompts)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

https://github.com/dongxiangjue/Co2PT (data and scripts)Public datasets: STS-B, SNLI, Bias-in-Bios via referenced links

Risks & Boundaries

Limitations

Evaluation focuses on binary gender and English; non-binary and non-English cases are not empirically addressed.

Co^2PT uses prompt tuning which can underperform full fine-tuning on some small models or datasets.

When Not To Use

If your target bias attribute is not present in training text and cannot be reliably counterswapped.

If you require debiasing that was validated in non-English languages without additional work.

Failure Modes

Too-short prompts reduce debiasing power (prompt length sensitivity shown).

High contrastive temperature τ (e.g., 0.5) or low α weakens contrastive signal and reduces benefit.

Core Entities

Models

Co2PTPT (deep prompt tuning)BERTBERT+CDAZariCDAZariDOADELEADELE-TAContext-DebiasAuto-DebiasMABEL

Metrics

Diff (avg absolute similarity difference)τ thresholds (τ=0.1,0.3)Pearson / Spearman correlationsNet Neutral (NN)Fraction Neutral (FN)Threshold T:0.5 / T:0.7GAP TPRGAP RMSAccuracy

Datasets

STS-BSNLIBias-STS-BBias-NLIBias-in-BiosCommon Crawl (for bios source)

Benchmarks

Bias-STS-BBias-NLIBias-in-Bios